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In This Issue 


This issue of Survey Methodology contains the fourth in the annual invited paper series in honour 
of Joseph Waksberg. A brief description of the series and a short biography of Joseph Waksberg were 
given in the June 2001 issue of the journal. I would like to thank the members of the awards selection 
committee for having selected Norman Bradburn as the author of this year’s Waksberg invited paper. 

In his paper entitled “Understanding the Question-Answer Process”, Bradburn traces the history of 
conceptualization of the survey process over the past couple of decades, in which concepts from social 
and cognitive psychology and linguistics have been applied to improving our understanding of this 
process, and cognitive tools and approaches have been adapted for use in formulating survey 
instruments. He presents a conceptual model for the survey interview, and discusses various cognitive 
processes in survey response such as comprehension, retrieval, answer formulation and response. In 
his concluding summary he outlines challenges and priorities for further research in this area. 

In Demnati and Rao, the authors present an approach for obtaining Taylor linearization variance 
estimators that is easier to apply than the usual Taylor linearization approach. The new method leads 
to a unique variance estimator and is applicable in many situations and estimators. The method is 
illustrated for calibration estimators, estimating equations and under two-phase sampling. For 
calibration estimators, the calibration weight is automatically captured in the variance formulae thus 
justifying what is commonly done in practice. Discussions of this paper are provided by Phil Kott, 
Babubhai Shah, and Chris Skinner. 

Isaki, Tsay and Fuller propose a new method of household weighting for the 2000 U.S. Census long 
form, using quadratic programming to ensure that the weighted sums of household and individual 
characteristics match control totals derived either from the Census short form or from the Accuracy 
and Coverage Evaluation (A.C.E.) study. The weights are then rounded to integer values. They 
propose a jackknife procedure for estimation of the variance that incorporates the effects of both 
rounding and the random controls from A.C.E. Results of the proposed weighting procedures are 
compared to the 1990 weighting procedures using the 1990 Census data. 

The theoretical properties of the estimator through reweighting within cells are studied in the article 
by da Silva and Opsomer. In contrast with numerous other studies on the subject, which involve a 
response model in which the population units are homogeneous within cells, it is not necessary to 
correctly specify the response model. It is necessary, however, to determine an auxiliary variable that 
is correlated with the response probability. The proposed approach can thus be seen as non-parametric. 
A simulation study explores the properties of the estimator being considered under various scenarios. 
The authors also provide some recommendations on the size and number of reweighting cells. 

Brick, Kalton and Kim deal with the estimation of variance in the presence of hot-deck imputation 
within imputation cells for linear estimators. Sarndal’s decomposition (1992) and a model for the 
variable of interest are used to estimate variance. The originality of the proposed approach comes from 
the fact that, not only are the sampled and responding units conditioned, but also the units selected at 
the time of imputation. The article also deals with estimation for domains and a simulation study is 
carried out to evaluate the proposed method when certain model assumptions do not hold. 

Hidiroglou and Patak study the properties of a number of small area estimators. They classify the 
estimators into two types, Horvitz-Thompson and Hajek, and by the detail of auxiliary information 
required. Conditional and unconditional properties of the estimators are investigated both analytically 
and in a simulation study. They conclude that the Hajek-type estimators have the best conditional 
properties, both in terms of bias and coverage, but these estimators do not have the additive property 
and their weights are domain dependent. 


In This Issue 


In their paper, Sverchkov and Pfeffermann develop prediction of finite population totals using a 
model for a variable of interest conditional on the unit not being in the sample (the sample- 
complement distribution) and possibly some covariates. They first describe the sample distribution and 
the sample-complement distribution, and then develop semi-parametric estimation of the sample 
complement model. A resampling procedure is proposed for mean-square error estimation. The 
method is illustrated by examples and it is compared to alternative approaches in a simulation study. 

The article by Grilli and Pratesi considers the problem of parametric estimation for ordinal and 
binary models at a number of levels for informational sample plans. The authors extend the pseudo 
maximum likelihood method to deal with this problem. This method uses the inverse of the inclusion 
probabilities at each degree to weight the logarithm of the likelihood function. The estimator’s 
properties thereby obtained are tested in a simulation study. The bootstrap method is also used to 
obtain a variance estimator. 

Rowe and Nguyen explore longitudinal analysis using data from an overlapping panel survey, 
specifically, the Canadian Labour Force Survey. Successive six-month longitudinal panels can be used 
to provide estimates relating to cohorts of people over time, provided that cohort members can be 
identified in each panel. They develop a likelihood function for the longitudinal data observed in each 
six-month window, and show how this can be used to obtain estimates of parameters of interest. They 
then give an illustration of this approach for estimating transition probabilities between employment 
states and validate it by comparing simulated and observed data. 

Finally, in a paper somewhat related to Bradburn’s, Callens and Croux look at individual level and 
municipality level predictors of contact and cooperation in the Belgian Fertilily and Family Survey 
using multilevel logistic regression models. They discuss some social theory models for contact and 
cooperation that imply an important role for different indicators, and then fit models using data from 
the survey. Their qualitative findings, in particular with respect to socio-economic status (SES) 
indicators, seem to conflict with the results of similar studies in the literature. In this study, SES was 
found to be positively related to cooperation. Some possible explanations of the observed results are 
offered. 
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Waksberg Invited Paper Series 


Survey Methodology has established an annual invited paper series in honor of Joseph Waksberg, who has 
made many important contributions to survey methodology. Each year, a prominent survey researcher will 
be chosen to author a paper that will review the development and current state of a significant topic in the 
field of survey methodology. The author receives a cash award, made possible through a grant from Westat 
in recognition of Joe Waksberg’s contributions during his many years of association with Westat. The grant 
is administered financially and managed by the American Statistical Association. The author of the paper is 
selected by a four-person committee appointed by Survey Methodology and the American Statistical 
Association. 


The author of the Waksberg paper is announced at the annual Joint Statistical Meeting during the American 
Statistical Association Presidential Address and Awards session. In this session, recipients of awards such as 
Section, Chapter, Continuing Education-Excellence and other co-sponsored awards are congratulated. In 
particular, the Waksberg Award for outstanding contributions in the theory and practice of survey 
methodology is highlighted. Finally, the winner of the Waksberg award appears in the Awards program 
booklet. 


Previous Waksberg Award Winners: 


Gad Nathan (2001) 
Wayne A. Fuller (2002) 
Tim Holt (2003) 


Nominations: 


Nominations of individuals to be considered as authors or suggestions for topics 
should be sent by December 3, 2004 to the chair of the committee, David 
Bellhouse by e-mail at: bellhouse @stats.uwo.ca or by fax (519) 661-3813. 


2004 WAKSBERG INVITED PAPER 
Author: Norman M. Bradburn 


Norman Bradburn is the Tiffany and Margaret Blake Distinguished Service Professor Emeritus in the 
University of Chicago. He has spent most of his career as a survey methodologist at the National Opinion 
Research Center (NORC) at the University of Chicago where he is currently a Senior Fellow. His research 
has concentrated on the study of non-sampling errors in surveys with particular emphasis on the cognitive 
aspects of the survey question/answer process. 
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Understanding the Question-Answer Process 


NORMAN M. BRADBURN ' 


ABSTRACT 


Survey statisticians have long known that the question-answer process is a source of response effects that contribute to non- 
random measurement error. In the past two decades there has been substantial progress toward understanding these sources 
of error by applying concepts from social and cognitive psychology to the study of the question-answer process. This essay 
reviews the development of these approaches, discusses the present state of our knowledge, and suggests some research 


priorities for the future. 


KEY WORDS: Measurement errors; Response effects; Cognitive psychology; Questionnaire design. 


1. INTRODUCTION 


When I was in graduate school, I was deeply impressed 
by Gordon Allport’s comment to the effect that the best way 
to find out something was to ask a direct question. Later, as 
I began to study and do research on methodological 
problems in sample surveys of human populations, I 
became more convinced of the wisdom on this remark. I 
have even formulated it into Bradburn’s Law for Ques- 
tionnaires: “Ask what you want to know, not something 
cise. 

The trouble with this law is that it is extremely difficult to 
put into practice for several reasons. First, it presumes that 
we know what we want to know. Often when we start out to 
construct a questionnaire, we are not sure what we want to 
know and use the questionnaire construction process in an 
iterative fashion to refine our ideas about what we want to 
know. Until we have a clear understanding of what we are 
trying to ask about, there is little hope that we will be able to 
ask meaningful questions. 

Second, even if we know what we want to know, we 
need to understand how people answer questions. The 
complexities of human communication make it difficult to 
construct of single, standardized instrument that will enable 
us to ask our questions so that respondents will understand 
them in the way that we intend and that we will understand 
their answers in the way they intend. Belson (1968), who 
has done extensive studies on the comprehension of 
questions by respondents, estimates that even with the best- 
constructed questionnaires, less than half of the sample will 
understand the questions the way the researcher intended. 
He does not present any data on how well the researchers 
understand the responses. 

Even if this estimate is too pessimistic, we are faced with 
a difficult problem of measurement error that comes from 
the question-answer process itself, rather than from sample 
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design or survey execution. The existence of this source of 
measurement error has been recognized since the beginning 
of scientific surveys, that is, since the development of 
sampling theory and its application to human populations. 
Unlike sampling theory, which rests on firm mathematical 
principles, the understanding of measurement error due to 
the question-answering process has not, until recently, been 
based on the theoretical understanding of human commu- 
nication and cognition. This situation is beginning to 
change. 

In the past two decades there has been substantial 
progress in the conceptualization of the survey interview 
applying concepts from social and cognitive psychology 
(Jabine, Straf, Tanur and Tourangeau 1984, Sudman and 
Bradburn 1974, Sudman, Bradburn and Schwarz 1996, 
Tourangeau, Rips and Rasinski 2000). In this essay I will 
review briefly the development of these approaches, discuss 
the present state of our knowledge regarding the question- 
answer process, and suggest some research priorities for the 
future. 


Some History 


The collaboration between cognitively oriented psycho- 
logists and survey researchers began about 25 years ago. 
Like many innovations it had many progenitors and seemed 
to spring up from several independent sources. One of the 
earliest, if not the earliest instance, was a seminar held in 
1978 by the British Social Science Research Council and 
the Royal Statistical Society on problems in the collection 
and interpretation of recall data in social surveys. Parti- 
cularly noteworthy was the participation of the Cambridge 
cognitive psychologist Alan Baddeley whose paper, “The 
Limitations of Human Memory: Implications for the Design 
of Retrospective Surveys,” is perhaps the first paper by a 
psychologist interested in memory directly related to survey 
design (Baddeley 1979). 


Norman M. Bradburn, National Opinion Research Center, University of Chicago. 


Two important events occurred in the United States in 
1980. The first was a workshop convened by the Bureau of 
Social Science Research in connection with its work in the 
redesign of the National Crime Victimization Survey. This 
workshop brought together cognitive scientists and survey 
statisticians and methodologists to discuss what contribu- 
tions cognitive scientists could make to under standing 
response errors in behavioral reports (Biderman 1980). One 
of the results of this conference was to stimulate some of the 
cognitive psychologists who participated to begin to study 
problems in survey questions in a laboratory setting. One of 
the earliest of such papers was “Since the eruption of Mt. St. 
Helens has anyone beaten you up? Improving the accuracy 
of retrospective reports with landmark events, ” (Loftus and 
Marburger 1985) which demonstrated experimentally the 
value of using landmark events to improve the quality of 
dating events in survey reports. 

The second event was the establishment of a panel on the 
measurement of subjective phenomena by the Committee 
on National Statistics. This panel produced two large 
volumes that reviewed a considerable amount of research on 
response effects involved in the measurement of subjective 
phenomena. It complemented the work that had been done 
by the earlier seminars on measuring behavior or more 
“objective” phenomena. (Turner and Martin 1982) 

A big stimulus came in 1983 when the Committee on 
National Statistics with funding from NSF organized a 6- 
day seminar in St. Michaels, Maryland on Cognitive 
Aspects of Survey Methodology. Two papers, “Potential 
contributions of cognitive research to survey questionnaire 
design” (Bradburn and Danis 1984) and “Cognitive science 
and survey methods,” (Tourangeau 1984) reviewed how 
new developments in cognitive psychology could contribute 
to survey methodology and how developments in survey 
methodology could contribute to the further development of 
cognitive psychology. The conference was extraordinarily 
fruitful and led to a whole new field of research in survey 
methodology both as applied to objective and subjective 
phenomena. The results of this conference were published 
in Jabine et al. (1984). 

The final instance of independent work that may be 
thought of a progenitor of this field was a conference 
organized by Norbert Schwarz and his associates in 
Germany. Perhaps the most influential paper from this 
conference was the model proposed by Strack and Martin 
(1987) “Thinking, judging and communicating: A process 
account of context effects in attitude surveys.” The results of 
the conference are published in Hippler, Schwarz and 
Sudman, Social Information processing and survey 
methodology (1987). 

In the ensuing years, there has been a stream of research 
that has refined and elaborated the research agenda that 
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came from these early seminars. Some of the work 
sponsored by the Social Science Research Council is 
published in “Questions about questions: Inquiries into the 
cognitive bases of surveys” (Tanur 1992). Subsequent 
research has been updated in a series of volumes edited by 
Schwarz and Sudman (1992, 1994, 1996). 


A Conceptual Approach to the Survey Interview 


A survey interview is a structured social interaction 
between two people who play distinctive roles-the inter- 
viewer and the respondent. It has been described as a 
“conversation with a purpose” (Bingham and Moore 1934). 
The purpose, to put it succinctly, is to get a series of 
questions answered. In scientific surveys, these questions 
are usually embodied in a structured questionnaire designed 
by a third party, the researcher. It is this type of survey 
activity that I will consider, although the analysis could be 
extended to other, less structured interviews. 

Like all social interactions among people from the same 
culture, there are implicit rules that influence the way the 
participants behave. Some of these are general and apply to 
all social interactions between social equals; some are 
general to the peculiar type of interaction we call the survey 
interview; some are general to this survey; and some are 
idiosyncratic and apply to only this particular interview. 
Thus we think of these rules as hierarchically organized 
from the most general, which apply to all survey interviews, 
to the particular rules involved in a particular interview. 

At the most general level the interaction is governed by 
the rules for voluntary interactions between strangers. The 
interaction is initiated by one party, the interviewer, who 
must establish the nature of the encounter. The important 
elements that must be established are: 1) that it is non- 
threatening, that is the interviewer is not going to do any 
harm to the respondents; 2) the purpose of the encounter, 
and 3) what are the costs and benefits to the respondents if 
they agree to participate in the interview. The interaction is 
thus viewed as neutral, purposive, and worthwhile. As with 
any structured social interaction, it is governed by the norms 
related to such interactions. 

What are the norms that are important for the interview? 
The first is mutual respect for individuals, particularly the 
privacy of the respondents. This principle has become an 
important issue regarding the protection of research parti- 
cipants because of a number of instances in bio-medical 
research where the voluntary nature of participation was not 
made clear. For high-risk research written consent to 
participate is now required. In the survey interview, 
however, the context of the request for an interview makes it 
easy for respondents to refuse if they do not wish to 
participate and written consent is superfluous. Asking for 
written consent may actually raise suspicion that the 
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interviewer has not been truthful about the purpose of the 
interview because written consent is not normally part of a 
conversation between strangers who have established that 
the interaction is non-threatening. 

A second important norm is truthfulness. It is part of the 
role obligation of both parties to be truthful. For the inter- 
viewer, this means telling the respondent pertinent facts 
about the purpose of the interview, what is required of the 
respondents, e.g., how much time it will take, whether they 
will need to consult records, whether the questions may be 
sensitive, etc. and to answer any questions the respondents 
might ask. If providing some information at the beginning 
of the interview might bias responses, such as who the 
sponsor of the research is, the information can be given at 
the end of the interview. 

The purpose of the interview is to obtain the information 
required by the research. The interviewer’s role is to get the 
desired information and the questionnaire is the principal 
instrument for accomplishing this task. A well-designed 
questionnaire makes the interviewer’s job easier and 
minimizes the need for the interviewer to have to answer 
questions about the meaning of questions in the question- 
naire. While interviewers need to be trained about the 
purpose of questions and their meaning, interviewers may 
become a source of uncontrolled variance if they have to 
interpret questions for many respondents. Interviewers need 
to be alert to cues that respondents are misunderstanding 
questions and to act to correct them. The need for many 
interventions by interviewers indicates a bad questionnaire. 

If respondents accept the role and agree to participate in 
the interview, they have the obligation, under the norm of 
truthfulness, to answer the questions as accurately and 
completely as possible. This norm, however, may conflict 
with the general desire of individuals to be well thought of 
and to present themselves in a favorable light. In many 
surveys, we ask questions about potentially embarrassing, 
sensitive or even illegal behavior or unpopular attitudes. 
The interviewer and the questionnaire both play an 
important role in minimizing this conflict and reinforce the 
norm of truthfulness. The empirical evidence, however, 
suggests that even with the best trained interviewers and the 
best techniques of questionnaire design, it is rarely possible 
to prevent some overreporting of socially desirable behavior 
and attitudes or underreporting of undesirable attitudes and 
behavior (See Bradburn, Sudman and Associates 1979; 
Wentland and Smith 1993). 

Survey data are collected under a strong norm of confi- 
dentiality. The norm is so strong that even if it is not made 
explicit, respondents expect that information from inter- 
views that have the form of scientific surveys, such as 
public opinion polls or employee attitude surveys, will not 
be identified with them. Violations of this norm such as 
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occur with “‘sugging” (selling under the guise of a survey) or 
“frugging” (fund raising under the guise of a survey) 
threaten to erode public confidence in surveys and contri- 
bute to the increase in rates of refusal to participate. Unless 
the data are collected under “shield laws” or certificates of 
confidentiality that have the force of law, confidentiality 
promises, however, can be compromised by law enforce- 
ment activities. 

Linguists have also noted that there are basic shared 
assumptions underlying conversations that facilitate the 
interactions. These have been systematically described by 
Grice and are referred to as Grician rules (Grice 1975, see 
also Sudman et al. 1996 for their application in surveys). 
According to Grice, conversations are based on a principle 
of “cooperativeness” which is embodied in four maxims. 
The maxim of quality enjoins speakers to be truthful and not 
to say things that they lack evidence for. The maxim of 
relation indicates that the utterances are relevant to the topic 
of the ongoing conversation. The maxim of quantity 
requires that speakers not repeat themselves and make the 
contributions to the conversation as informative as possible. 
The maxim of manner requires that the speakers be as clear 
as possible in their meaning. Thus, according to Grice, 
speakers are expected to be truthful, relevant, informative 
and clear. 

These maxims apply equally to informal conversations 
and to interviews that have the form of a special type of 
conversation. Thus the questions asked by the interviewer 
are interpreted within the same framework, that is both 
questions and introductory material to questions are relevant 
to the topic, are supposed to be informative and clear. 
Violations of these maxims can lead to confusion on the part 
of respondents and produce response effects that are well 
documented. For example, violations of the maxim of rele- 
vance when questions are obscure (see for example, 
Schuman and Presser 1981) or deliberately about fictitious 
issues (Bishop, Oldendick and Tuchfarber 1986) lead to 
respondents trying to make sense of the question by looking 
to contextual cues about the meaning of the question. This 
produces what appears to be an erroneous response when 
viewed from the perspective of the researcher who does not 
understand the conversational assumptions of the 
respondents. 

One of the most well documented order effects in 
surveys occurs when questions of differing levels of 
specificity occur together. When one question is general, 
e.g., “Taking all things together, how happy are you these 
days? ‘“ and the other is specific, e.g., “How happy is your 
marriage?’’, responses to the general question are affected 
by the order of the questions, while responses to the more 
specific question are not. The effect appears to be the result 
of the workings of the maxim of relevance. When the 
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general question comes first it is interpreted as intended, that 
is, respondents should include all aspects of their lives in 
making the judgment about their happiness. When the 
general question comes second after the specific question 
about marriage happiness, the maxim of relevance suggests 
that respondents should exclude from consideration their 
marriages because they have already reported on them. 
Thus, even though the question literally asks about “all 
things together’, it is interpreted to mean “all things except 
those we have already asked about.” It is only those things 
that have not been asked about that are still relevant. 

What happens if the norms outlined above are not 
accepted in the interview either because the respondent 
rejects or redefines the role of respondent or does not 
observe the maxims of conversation? Of course the easiest 
form of rejection of the role of respondent is to refuse the 
interview altogether. Sometimes, however, a person 
sampled becomes a “reluctant respondent’, that is, they are 
may feel pressured to participate in the study because of 
follow-up procedures, because they do not like to refuse a 
strong request from another person or for some other reason. 
In such cases they may care less about being a good 
respondent than just getting the interview finished. Thus 
they may take less time to think about questions, make less 
effort to recall information requested, or be less interested 
in a truthful answer than a “don’t know” or even a false 
answer. Interviewers have told me that they often feel that 
the responses given by those that they have convinced to 
participate in an interview after many attempts at refusal 
conversion are less valid that those who participate more 
willingly. Extras efforts to obtain high completion rates may 
in fact produce less good data. 

Respondents also may misunderstand the nature of the 
survey interview, simply want to convert it into a social 
conversation, or not be very skilled conversationalists, that 
is not abide by the Grician maxims and thus engage in an 
“inefficient” conversation. Such conversations are charac- 
terized by frequent asides or changes of topic, comments on 
topics of little or no relevance to the question at hand, 
relating personal anecdotes that may be triggered by some 
aspect of the question, or simple repetition of comments. In 
such cases the interviewer must politely but firmly teach the 
respondent the rules for the conversation and guide the 
respondent to keep focused on the questions in the inter- 
view. Skilled interviewers become experts in steering the 
conversation and, by selective reinforcement, shaping the 
respondents’ behavior to follow the Grician maxims. 

In summary, interviews take place in social contexts that 
have a structure governed by socially shared expectations 
and norms. These norms may differ from society to society 
and perhaps even within subcultures in the same society, but 
they have powerful effects on the way interviews are 
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conducted and the way questions are interpreted. Violations 
of the expectations or norms may lead to “effects” that may 
be interpreted as error from the perspective of the 
researcher. If these norms and expectations are understood, 
they can be used to avoid problems or to mitigate the 
effects. 

Data could also be obtained from interviewers about how 
much the interview deviated from the model outlined above. 
Although little research has been done assessing the quality 
of interviews from this point of view, a fruitful area for 
future research could be to investigate the decline in validity 
of data as the conditions of the interview increasingly 
deviate from the ideal model. 


Cognitive Processes in Survey Response 


Answering questions in a survey involves considerable 
cognitive work on the part of respondents. Much of what 
underlies recent advances in understanding survey response 
processes derives from the application of models of infor- 
mation processing to the question-answering process. While 
there is still much work to be done before we have complete 
and detailed understanding of how the brain processes infor- 
mation, there is sufficient agreement about the general 
approach to serve as the basis for a better understanding of 
the response process. 

The mind is conceptualized as a large information 
processing system composed of a series of component 
systems. The physical sensations of sound and sight enter 
the system in the sensory register. The sensory register has 
capacity limitations so that only a portion of the information 
is transferred to short-term memory. Attention plays a large 
role in determining what is brought into short-term memory. 
Attention is a function of an executive monitor that enables 
and controls the information processing system much the 
way that programs enable what computers do. The execu- 
tive system controls the entire system through goals and 
plans that are organized into priorities for action. 

The storehouse of the system is the long-term memory 
system that has a very large capacity. Working memory 
refers to the system in which active thinking takes place. 
The activity here draws on short-term memory and 
retrievals from long-term memory. Short-term memory has 
limited capacity but rapid access, while long-term memory 
has large capacity but is relatively slow in access. Long- 
term memory appears to have two rather distinct sub- 
systems, semantic memory and episodic memory, although 
this distinction is not universally agreed upon. Semantic 
memory refers to memory associated with vocabulary, 
language structure, rules and abstract knowledge, while 
episodic memory refers to memory for events that took 
place in time and space. 
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Information is represented as a list of features or concepts 
that are linked together in networks. Information is stored in 
memory in structures that are hierarchically organized with 
more general concepts being higher in the structure than 
more discrete instances of the concept or distinct features. 
The term “schema” is sometimes used to refer to larger, 
more complex shared and/or overlearned structures that 
organize our thoughts on familiar topics and may be 
retrieved as a whole rather than as individual parts. 

Language is the medium through which information is 
primarily communicated and thus information, to be 
available for communication, must be associated with a 
linguistic code. The exact relationship between language 
and thought and whether or not all thoughts have verbal 
representation are still subjects of debate. It is clear, how- 
ever, that meaning is encoded somehow in language and 
these codes play an important role in the acquisition, storage 
and retrieval of information. Emotion may also be part of 
the code, although its role is not well understood. 

Knowledge structures facilitate and constrain patterns of 
activation in the mind. What comes to mind, that is, into 
consciousness, is limited and is the result of the activation of 
the networks. Activation is rapid but goes along pathways 
determined by the ways information is encoded. Encoding 
puts information into particular categories and structures the 
pathways by which the information will be retrieved. Cues 
are stimuli that are related to the codes and stimulate the 
activation of the networks. Activation is rapid but does take 
time. The amount of time it takes for someone to respond to 
a stimulus (reaction time) is often used in research as a clue 
to the way information is coded. 

There are number of models of the question-answering 
process (Cannell, Miller and Oksenberg 1981; Strack and 
Martin 1987; Tourangeau and Rasinski 1988; Sudman 
et al. 1996;) that, while differing in details, generally agree 
on a series of processes respondents go through in 
answering questions. These processes are: 1) compre- 
hending the meaning of the question; 2) retrieving relevant 
information; 3) formulating an answer; 4) formatting and 
editing the answer to meet the requirements of the inter- 
viewer and respondents self-presentation. While concept- 
ually viewed as a linear sequence, it is recognized that in 
reality the processes occur in the flow of a conversation and 
that the different processes may go on in parallel or in rapid 
cycling back and forth. For purposes of considering the 
question-answer process, it is useful to consider them as if 
they were separate and proceeded in an orderly sequence. 


Comprehension 


In order to answer a question, respondents must first 
understand what they are being asked. The goal for the 
researcher is for respondents to understand the question in 
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the same way that the researcher does. This goal is very 
difficult to reach because of the many subtleties and ambi- 
guities of language. Indeed Belson (1981), who has studied 
extensively respondents’ understanding of common terms 
such as “weekday”, “children,” “regularly” and 
“proportion,” found widespread misunderstanding even in 
questions using such common terms. 

Comprehension begins with a perceptual process of 
interpreting a string of sounds or written symbols as words 
in a language that respondents understand. The string of 
words is “parsed" into syntactical units that are understood, 
that is, the meaning that is encoded in the linguistic units is 
extracted by a process that is still poorly understood. Many 
comprehension problems occur because of ambiguities 
arising from words that have different meanings (lexical 
ambiguity) or are used in different ways (structural 
ambiguity). For example, the question “Where is the table?” 
is lexically ambiguous because the word “table” can refer to 
an object on which things can be placed or a set of numbers 
arranged in a sheet of paper. The sentence “Flying planes 
can be dangerous” is structurally ambiguous. The interpre- 
tation depends on whether “‘flying” is understood as a verb 
or as an adjective. Structural ambiguities can be resolved by 
careful wording of questions. Lexical ambiguities, on the 
other hand, are inherent in language and are usually 
resolved by the context within which the sentences appear. 

Context plays an important role not only in resolving 
ambiguities but also aids in interpreting the meaning of 
words that are unfamiliar. For example, a study by Schuman 
and Presser (1981) found that a question about the Monetary 
Control Bill, an obscure piece of proposed legislation, was 
interpreted as referring to an anti-inflationary measure when 
it occurred after a series of questions about inflation, but 
was interpreted as referring to controls of the international 
transfer of money when it occurred after questions dealing 
with the balance of payments. 

The underlying psychological mechanism for these types 
of context effects is priming. In order to interpret the stream 
of sounds or written symbols, we have to draw on our 
semantic memory that contains the store of linguistic infor- 
mation that enables us to understand the languages we 
know. Since this is a large store of knowledge, it takes time 
to retrieve information, and some things will be more easily 
accessible than others. Those bits of information that have 
been recently activated are more easily accessible and will 
be used first to interpret what is being said or read. Priming 
activates thoughts or “schemata”, that is, organized thoughts 
about objects or concepts, so that they are more accessible 
to consciousness and thus more easily come into play in 
interpreting the questions. In the example above, previous 
questions have primed either thoughts about inflation or 
about international flows of money, so that when the 
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unfamiliar concept of the Monetary Control Bill is asked 
about, the thoughts that have been primed come more 
rapidly to the fore and affect the interpretation of the words. 

Different meanings may be differentially accessible to 
different respondents because of the frequency with which 
they employ them in daily life. For example, Billiet (cited in 
Bradburn 1992, page 317) observed that, in response to the 
question “How many children do you have?’ some 
respondents offered numbers between twenty and thirty. 
Further inspection of the data revealed that these 
respondents were teachers who interpreted the question to 
refer to the children in their classes, the meaning that was 
most accessible in their memories. 


Information Retrieval 


Once a question has been comprehended, respondents 
must retrieve from memory the information necessary to 
answer the question. In almost all cases this means 
retrieving the information from long-term memory. If the 
question is about behavior, the relevant information is likely 
to be stored in episodic memory. If the question is about 
attitudes, the relevant information is likely to be stored in 
semantic memory, but may require some retrieval from 
episodic memory. 

Remembering is a process by which the memory 
storehouse is searched to retrieve a particular item that is 
being sought. If we think of memory as a big storehouse, it 
is clear that it must be organized in some way in order for us 
to be able to retrieve things from it. Just as we must label 
files when we put them in file drawers, so we must attach 
some kind of labels to information in the memory 
storehouse. The labeling process, often called “encoding,” 
refers to various aspects of the information or the 
experience, including emotional tone, attached to the item 
when we stored is it memory so that we can retrieve it. (For 
a more complete discussion of memory models see 
Tourangeau et al. 2000, Chapter 3). 

Barsalou (1988) has proposed a theory that provides a 
good framework for understanding how information about 
personal events is stored in memory. He notes that infor- 
mation about activities or event types in episodic memory 
includes not only specific events but also extensive 
idiosyncratic, generic knowledge about the events, that is, 
having a generic mental image of some types of activity, 
é.g., visiting a pediatrician, rather than an image of a 
particular event, e.g., going to Dr. Jones about your 
daughter’s rash (Brewer 1986, 1994). For activities to be 
stored in memory, they must be comprehended. In other 
words they must be understood within some meaning 
system, usually linguistic, that brings to bear knowledge of 
past activities and generic knowledge about similar event 
types as well as specifics of the event itself and the context 
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within which it occurred. This complex set of information 
that goes into the comprehension of the event becomes 
integrated into the memory of the event. The comprehension 
process determines how the memories are encoded. 

Information, such as the wording of the question and any 
explanatory material available to respondents at the time 
they are asked to recall an event, acts as retrieval cues. 
Retrieval cues are any words, images, emotions, efc. that 
activate or direct the memory search process. If retrieval 
cues do not specify the event type, e.g., pediatrician visits, 
then the event types must be inferred before the search can 
begin. This inference can come from the wording of the 
question or from the larger context in which the question is 
asked, including the preceding questions or the introductory 
material to the survey. 

Retrieval is an active process that is facilitated by cues in 
the question that activate the pathways of association 
leading to the desired information. Because information, 
both in episodic and semantic memory, is encoded in many 
different ways, the cues in the question or in the context 
surrounding the question including previous questions, may 
facilitate or constrain the activation and produce better or 
less good retrieval. 

Retrieval takes time. One clear empirical finding is that 
giving respondents more time to answer questions produces 
more accurate reports, particularly for behavioral questions. 
But time is not all there is to it. Memories for events in 
one’s life appear to be organized in event sequences 
(Barsalou 1988), for example, a summer vacation or a 
hospitalization, which are hierarchically organized. Giving 
respondents cues to remind them about the sequence is more 
effective than trying to get them to retrieve information 
about a specific event. For example, in questions about 
alcohol consumption, giving examples of the kinds of situa- 
tions in which one might drink increases consumption 
reports. 

Examples are an important aid to recall, but they are not 
a panacea. Giving respondents of list of magazines that they 
might have read improves reports of reading; a list of 
organizational types helps respondents remember all the 
organizations they belong to. While examples may help 
reduce omissions, they have the effect also of being direct 
cues for memory and result in greater reports for the types 
of items on the list. If an important type of activity or event 
is omitted from a list, the lack of a cue for that type of 
activity may result in underreporting. The cuing effect of 
question wording can scarcely be overestimated. 

When thinking about retrieval, we mostly think about 
forgetting or failure to retrieve relevant information. Some 
times, however, incorrect information may be retrieved that 
results in overreporting behavior. The best-known example 
is the phenomenon observed by Neter and Waksberg (1964) 
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called “telescoping”’, that is, recalling events that took place 
at a time other than the time period asked about. 
Telescoping occurs in response to questions about behavior 
in a defined time period such as: “How many times have 
you been to the doctor in the past 6 months?” Neter and 
Waksberg found in analyzing data from the Consumer 
Expenditure Survey that when respondents reported on 
purchases in different reference periods, there was a 
systematic overreporting of purchases that came from 
reporting purchases made in a pervious period as if they had 
been purchased in the period being asked about. While the 
phenomenon has been observed in a number of studies, 
there had been no cognitive explanation for it until recently. 

Memory for the time of events becomes more uncertain 
the further back in time the event happened, even though 
there is no systematic bias in the reports. Telescoping results 
from the conjunction of two processes-rounding and 
bounding. Rounding refers to the fact that respondents 
round their estimates for when things took place in 
successively larger periods the further back in time an event 
occurred. For example, events are remembered: as having 
occurred in “days ago” discretely up to about 7 days ago, 
then they are rounded to periods such as 10 days, two 
weeks, 4 weeks, 3 months, and 6 months ago. Bounding 
refers to the aspect of the question that limits the time of 
reports, e.g., the last 6 months. The effect of this bounding is 
to truncate reports of events that are remembered as having 
occurred longer ago than 6 months. Since the variance in the 
memory for the dates of events becomes larger the further 
back the event occurred, a larger number of events will be 
incorrectly remembered as falling into the period the further 
back the events occurred. This overreporting of events from 
outside the period will not be offset by an underreporting of 
events in the near term because events cannot be reported 
that have not yet happened. Since there are no offsetting 
events remembered as occurring outside the period at the 
other end of the time boundary, i.e., the future, the result is a 
net overreport. (For a full explication of the model see 
Huttenlocher, Hedges and Bradburn 1990). 


Formulating an Answer 


Taking into account the information activated by the cues 
provided by the questions and the context in which they are 
asked and retrieved from memory, respondents must formu- 
late an answer to the question. Some information is easily 
accessible. For example, if the questions are about well- 
rehearsed topics, such as birthdates or marital status, or 
about topics for which the respondents have an already well- 
articulated position, respondents may retrieve the answers 
directly. They spring, as it were, fully formed from memory 
and can be reported directly. This kind of information we 
call chronically accessible. 
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On the other hand, if the questions are about behavior 
that has not been thought about recently and is not well- 
remembered or about attitudes that have not been well 
thought out or discussed, respondents must construct 
answers on the spot using all the information from whatever 
source available to them in working memory. This 
construction process utilizes not only chronically available 
information but also, importantly, information that is 
temporarily accessible because it has been activated by the 
question itself, contextual cues, previous questions, or any 
other aspects of the interview situation. 

There are several general cognitive processes that are 
pervasive strategies used to process information efficiently. 
Assimilation and contrast are two such fundamental 
processes that affect communications. In the study of 
perception, assimilation refers to the tendency to perceive 
stimuli as more alike that they actually are. Contrast refers 
to the tendency to perceive stimuli as more different than 
they actually are. Applying these principles to survey 
answering leads to what has been called the inclusion/ 
exclusion model (Schwarz and Bless 1992; Sudman et al. 
1996). Information that is included in the temporary 
representation that respondents form of the target of the 
question will result in assimilation effects because the 
judgment required to answer the question is based on infor- 
mation included in the representation used. If the informa- 
tion is positive, the judgment will be more positive. If the 
information is negative, the judgment will be more negative. 
The size of the effect depends on the amount and extremity 
of the temporarily accessible information 

Previous questions may activate thoughts that are then 
included in the representation of topics of later questions. 
The impact of a given question decreases as the number of 
other context questions increases. For example, answering a 
question about marital happiness had a pronounced effect on 
answers to subsequent questions about general life satis- 
faction when respondents’ marriages were the only specific 
life domain asked about. When respondents were asked 
about their leisure time and their jobs in addition to 
questions about their marriages before reporting on life 
satisfaction, the effect was significantly reduced. (Schwarz, 
Strack and Mai 1991). 

Information that is excluded rather than included in the 
temporary representation of the target will lead to a contrast 
effect. In this case, if the information excluded is positive, 
the judgment will become more negative; if the information 
is negative, the judgment will become more positive. 
Similarly the size of the effect depends on the amount and 
extremity of the temporarily accessible information. In 
effect, the excluded information is subtracted from the 
representation of the attitude object. 
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Excluded information, however, may play an additional 
role in formulating judgments. In addition to being excluded 
from the representation of the target, the information may be 
used in constructing a standard or scale anchor. In this case 
we speak of comparison-based contrast effects. The effect 
here is not caused so much by the subtraction of the 
excluded information from the evaluation of the attitude 
target, but by the comparison of the target with some 
standard or evaluated on some scale. 

Which of these processes drives the emergence of a 
contrast effect determines whether the contrast effect is 
limited to the single object or generalizes across related 
objects. If the contrast effect is based on simple subtraction, 
the effect is limited to that particular target. If the contrast 
effect is based on a comparison, the effects are apt to appear 
in each judgment where that standard of comparison is 
relevant. 

An example of a contrast effect based on using infor- 
mation from previous questions is provided in a study by 
Schwarz, Muenkel and Hippler (1990). Respondents were 
asked to rate a number of beverages according to how 
“typically German” they were. When this question was 
preceded by a question about the frequency with which 
Germans drink beer or vodka, contrast effects appear in the 
typicality ratings. Respondents who had estimated the 
consumption of beer first (a high frequency item), rated 
wine, milk and coffee as less typical German drinks than did 
respondents who had estimated the consumption of vodka 
first (a low frequency item), thus showing a contrast effect 
that extended across the three target drinks. This contrast 
effect, however, did not appear when the preceding question 
was about the caloric context of beer or vodka because the 
information activated by this question was not relevant to a 
judgment about typicality. 


Formatting and Editing Responses 


After respondents have formulated their responses, there 
remains the task of fitting these answers into the response 
formats that the interviewer offers. Rarely in surveys does 
the researcher allow respondents to answer questions in a 
free format. Open-ended questions have a multitude of 
problems not least of which is the cost and difficulty of 
transforming free-form answers in a format that can be 
treated quantitatively. Today almost all questionnaires 
depend on closed or pre-coded questions. 

Research on response alternatives is less well developed 
theoretically than the study of question wording and context 
effects. In general, the empirically observed effects are 
thought to stem from two sources-memory limitations and 
cognitive elaboration stimulated by the response alter- 
natives. 
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Memory limitations create some order effects among 
response alternatives. Primacy and recency are two well- 
known effects in the memory literature. When a series of 
stimuli are present visually, those that come early in the 
series are remembered better than those later in the series 
(primacy). When a series of stimuli are present in an audi- 
tory mode, those that come late in the series are remembered 
better (recency). Thus there is an interaction between the 
order in which stimuli are presented and the mode by which 
they are presented. 

The research literature has shown that there are 
persistent, although in general samples fairly small, primacy 
and recency effects in the serial position of response 
alternatives depending on the mode presentation. Primacy 
effects appear when the response alternatives are presented 
visually, as in show cards in personal interviewing, and 
recency effects appear in telephone interviewing when the 
respondents have to depend entirely on auditory memory for 
the response alternatives. More recent research (Knaeuper 
1999; Schwarz and Knaeuper 2000), however, reveals that 
the effect is very much a function of memory capacity and is 
sharply increased among older respondents whose memory 
is poorer and who depend more on the primacy or recency 
of the stimuli as supported by mode of presentation. Among 
older respondents, the primacy/recency effects can be quite 
large, on the order to 20 percentage points (Schwarz and 
Knaeuper 2000). Among younger respondents the effects 
are small. 

An intriguing theory to account for some observed 
response order effects within a question is that of cognitive 
elaboration. This theory draws on early work by Krosnick 
and Alwin (1987) and cognitive research on persuasion 
(Eagly and Chaiken 1993; Petty and Cacioppo 1986). This 
theory hypothesizes that the order and mode in which 
response alternatives are presented affects respondents’ 
opportunity to elaborate on their content. Such elaboration, 
in turn, activates thoughts in response to the question and 
provides retrieval cues in response to behavioral questions. 
The response alternatives provide supplementary cues that 
activate a range of thoughts that become temporarily 
accessible and may become part of the answer formulation 
process. In effect, the response alternatives are an essential 
part of the question but may be processed later in time after 
the question itself has been processed. 

The cognitive elaboration hypothesis suggests a number 
of complex predictions, few of which have yet been tested. 
One example for which there is considerable evidence, is an 
interaction between serial position and mode of adminis- 
tration in long lists. The primacy effect evident in visually 
presented material gives respondents time and stimulus to 
think more about alternatives early in the list before giving 
an answer. The crowding out of early alternatives by the 
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reading of later alternatives and recency effects evident in 
lists presented in an auditory mode suggest that the later 
alternatives can be more deeply processed cognitively. 
These effects are more robust that the primacy and recency 
effects that appear to depend more on simple memory 
limitations. 

Once a response alternative has been chosen in the 
respondents’ mind, the respondent may still edit the 
response. As mentioned earlier, the interview is a social 
situation and respondents may be concerned with self- 
presentation. There is ample evidence that social desirability 
is an important aspect of the response process and responses 
to sensitive question may be seriously distorted by 
unwillingness to admit to behavior or attitudes that would 
put the respondent in a bad light in the interviewer’s eyes or 
by the desire to over claim socially desirable behavior 
(Bradburn, Sudman and Associates 1979; Sudman and 
Bradburn 1974). There are several techniques for reducing 
social desirability bias, although there is no technique that 
totally and reliably eliminates it. The general strategy is to 
increase social distance between respondents and _ inter- 
viewers. This can be done by changing the mode of admi- 
nistration by eliminating or reducing the presence of the 
interviewer. Computer Assisted Personal Interviews (CAPI) 
which allow respondents to directly enter responses to 
sensitive questions into the computer as part of a face-to- 
face interview enable researchers to combine the benefits of 
a personal interview with a self-administered questionnaire. 
The use of audio enhanced CAPI (Audio-CAPI) which 
enables respondents to listen to a recorded voice reading the 
questions, although somewhat more expensive, overcomes 
literacy and language problems that might arise when 
respondents have to read questions from a computer screen. 

Research on mode effects generally indicates that self- 
administration of a questionnaire, particularly in an anony- 
mous, group setting, minimizes, but does not entirely 
eliminate desirability bias. Interviews done on the telephone 
generaly produce results that are intermediate between a 
face-to-face interview and a totally anonymous self- 
administration, although the results are not entirely 
consistent. 

In addition to reducing the social distance between inter- 
viewer and respondent by altering the mode of adminis- 
tration there are techniques for increasing the real or 
perceived anonymity of respondents that also reduce social 
desirability bias. For example, respondents may put their 
responses in a sealed envelope and mail them back to a 
central office so that they know that the interviewer cannot 
see their responses. 

Another technique is the so-called random response tech- 
nique, although it is more properly a random question tech- 
nique (Greenberg, Abul-Ela, Simmons and Horvitz 1969; 
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Horvitz, Shah and Simmons 1967; Warner 1965). The 
interviewer asks two questions, one sensitive and the other 
non-sensitive. Both questions have the same possible 
answers, “yes” and “no”. Which question the respondent 
answers is determined by a probability mechanism, such as 
flipping a coin or using a plastic box containing two colored 
beads, e.g., red and blue beads, in differing proportions, e.g., 
70% red beads and 30% blue beads. The box is designed so 
that when it is shaken by respondents a red or a blue bead 
seen only by the respondent will appear in the window of 
the box. If the bead is red, the sensitive question is 
answered; if blue, the non-sensitive question is answered. 
The interviewer does not know which question is answered. 
By using this procedure you can estimate the behavior of 
a group on the sensitive questions, but not that of any single 
individual. Thus with this method you cannot relate indivi- 
dual characteristics of respondents to individual behavior. If 
you have a very large sample, group characteristics can be 
related to the estimates obtained from randomized 
responses. For example, you could look at all the answers of 
young women and compare them to all the answers of men 
or young versus older age groups. On the whole, however 
much information is lost when randomized response is used. 
While, compared with other methods, randomized 
response greatly reduces the under reporting of undesirable 
behavior, it does little to reduce the overreporting of desir- 
able behavior. It also does not entirely eliminate under- 
reporting of undesirable behavior (Bradburn et al. 1979). 


CONCLUSION 


In this essay, I have tried to present the outlines of a 
social psychological approach to the understanding of the 
question-answer process in the survey interview. This 
approach draws on theory from sociology, cognitive 
psychology and linguistics, to present a comprehensive 
framework for research on response effects. Much, how- 
ever, remains uncertain or unknown. 

While social role theory provides a good starting point 
for conceptualizing the social relations among researchers, 
interviewers and respondents, there is much we do not know 
about how these roles are played by their respective actors 
and how they may be changing. Contemporary concerns 
about privacy and confidentiality of data and protection of 
human participants in research are changing to an unknown 
degree the way respondents view surveys and social 
research. Technology is changing respondents’ ability to 
protect their privacy and researchers’ ability to protect 
confidentiality of data. Response rates have been declining 
and greater efforts are required to convince sampled persons 
to respond. Interviewing is increasingly mediated by 
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computer-assistance, which may change the way in which 
respondents and interviewers interact and the way 
respondents view the interview situation. 

The cognitive processes involved in formulating an 
answer are complex and not yet fully understood. The appli- 
cation of our understanding of fundamental cognitive 
processes to the study of question formulation and order 
goes a long way toward improving our understanding of 
context effects. Cognitive science is making great strides in 
understanding how the brain works and how we organize 
and process information. New knowledge in these areas 
grows at a rapid pace. As we learn more, many of the 
conceptualizations outlined in this essay will change and 
either shown to be wrong or greatly elaborated. 

Finally there is a great challenge to linguistics. Many of 
the effects we have discussed in this essay occur because of 
ambiguities in language. Understanding how meaning is 
encoded in language and how we extract that meaning from 
spoken and written language is a formidable challenge. 
Perhaps more than anything else, our ability to resolve some 
of the most fundamental problems in questionnaire 
construction depends on progress in these areas. 

What are the high priority areas for research? In the short 
run, I would concentrate on better understanding of the 
biasing effects of declining respondent participation, parti- 
cularly on possible distortions of responses from reluctant 
respondents. We must develop response effect models that 
not only account for missing data, whether at the item level 
or at the whole person level, but also for response effects 
introduced by reluctant respondents who give only partial 
answers or not well-considered answers. Multiple imputa- 
tion models such as those developed by Little and Rubin 
(1987) and latent variable approaches such as developed by 
O’Muircheartaigh and Moustaki (1999) are promising. 
More empirical work is needed on the effects of pushing 
people into responding who inititially are unwilling to 
participate in a survey. 

In the longer run, further research is needed on the 
mechanisms by which questions and answer categories 
stimulate cognitive elaboration and activate thoughts that 
are then used in answering questions. We need to know 
what it is about questions that cause respondents to exclude 
information in making a judgment as contrasted with those 
that stimulate them to include information when they make 
judgments. Progress in this area will require a close collabo- 
ration between cognitive psychologists and survey metho- 
dologists and involve both laboratory and field survey work. 

In the end, however, fundamental understanding of the 
question-answer process will only come when we under 
stand how meaning is communicated between human 
beings. Questions have meaning that we expect respondents 
to comprehend. We can only go so far in improving the 
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process of clear communication without a much deeper 
understanding of the basic mechanisms of communication. 
We need a concerted multidisciplinary effort by linguists, 
psychologists, statisticians, and cognitive scientists and 
others to crack the meaning code much as natural scientists 
cracked the genetic code. It is one of the grand scientific 
challenges of our time. 
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Linearization Variance Estimators for Survey Data 


ABDELLATIF DEMNATI and J.N.K. RAO’ 


ABSTRACT 


In survey sampling, Taylor linearization is often used to obtain variance estimators for calibration estimators of totals and 
nonlinear finite population (or census) parameters, such as ratios, regression and correlation coefficients, which can be 
expressed as smooth functions of totals. Taylor linearization is generally applicable to any sampling design, but it can lead 
to multiple variance estimators that are asymptotically design unbiased under repeated sampling. The choice among the 
variance estimators requires other considerations such as (i) approximate unbiasedness for the model variance of the 
estimator under an assumed model, (ii) validity under a conditional repeated sampling framework. In this paper, a new 
approach to deriving Taylor linearization variance estimators is proposed. It leads directly to a variance estimator which 
satisfies the above considerations at least in a number of important cases. The method is applied to a variety of problems, 
covering estimators of a total as well as other estimators defined either explicitly or implicitly as solutions of estimating 
equations. In particular, estimators of logistic regression parameters with calibration weights are studied. It leads to a new 
variance estimator for a general class of calibration estimators that includes generalized raking ratio and generalized 
regression estimators. The proposed method is extended to two-phase sampling to obtain a variance estimator that makes 
fuller use of the first phase sample data compared to traditional linearization variance estimators. 


KEY WORDS: Calibration; Design weights; Estimating equations; Raking ratio estimator; Regression estimators; Two- 


phase sampling. 


1. INTRODUCTION 


Taylor linearization is a popular method of variance 
estimation for complex statistics such as ratio and 
regression estimators and logistic regression coefficient 
estimators. It is generally applicable to any sampling design 
that permits unbiased variance estimation for linear estima- 
tors, and it is computationally simpler than a resampling 
method such as the jackknife. However, it can lead to 
multiple variance estimators that are asymptotically design 
unbiased under repeated sampling. The choice among the 
variance estimators, therefore, requires other considerations 
such as (i) approximate unbiasedness for the model vari- 
ance of the estimator under an assumed model, (ii) validity 
under a conditional repeated sampling framework. For 
example, in the context of simple random sampling and the 
ratio estimator, Y p= (y/x)X, of the population total Y, 
Royall and Cumberland (1981) showed that a commonly 
used linearization variance estimator, D aN, z 
(n!-N")s £ does not track the conditional variance of Ye 
given x, unlike the jackknife variance estimator v,. Here y 
and x are the sample means, X is the known population 
total of an auxiliary variable x, s. is the sample variance of 
the residuals z, =y, -(y/x)x, and (n, N) denote the 
sample and population sizes. By linearizing the jackknife 
variance estimator, v,, a different linearization variance 


» Vp 
estimator, v,, =(X/x)’v,, is obtained. This variance 


estimator also tracks the conditional variance as well as the 
unconditional variance, where X =X/N is the mean of x. 
As aresult, v,, or v, may be preferred over v,. Yung and 
Rao (1996) considered generalized regression and ratio- 
adjusted post-stratified estimators under _ stratified 
multistage sampling and obtained a jackknife linearization 
variance estimator, v,, by linearizing v,. Valliant (1993) 
also obtained v,, for the ratio-adjusted post-stratified esti- 
mator and conducted a simulation study to demonstrate that 
both v, and v,, possess good conditional properties given 
the estimated post-strata counts. Sarndal, Swensson and 
Wretman (1989) showed that v,, is both asymptotically 
design unbiased and approximately model unbiased in the 
sense of E, (v,,) =V(Yp)> where E | denotes model 
expectation and V, (Y,) is the model variance of Y, under 
a “ratio model”: E, (y,) = Bx,;k =1,...,N and the y,’s are 
independent with model variance V, (y,) = 0°x,,0°>0. 
Thus, v fe is a good choice from either the design-based or 
the model-based perspective. 

Binder (1996) presented an elegant “cookbook” 
approach to Taylor linearization that leads directly to v,, - 
type linearization variance estimators. He applied the 
method to smooth functions of estimated totals, 
e(Y feast Hit); generalized regression estimators and the 
Wilcoxon rank sum statistic. To illustrate Binder’s method, 
consider a ratio estimator 
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where Y = yds) Y= V(y), X= eo. = Y(x) and 
the d,(s) are the design weights with d,(s) = 0 if the popu- 
lation element k is not in the sample s, e.g., d,(s) = 
(1/r,)a,(s) where x, is the probability of including the 
element k in the sample s, a,(s) = 1 if kes, a AGEs O other- 
wise, and » denotes summation over the population 
elements. The weights are assumed to provide a design 
unbiased estimator Y of Y, ie, E (d,(s)) = itor 
k=1,...,.N. Now take the total ae atl of ie to get 

(d¥,) = (dR)X = : * a?) -RaX), aay 
and replace all the total differentials in (1.1) by deviations 
of estimators from their respective population parameters, 
C.85, are is changed to Ni -Y. Then (1.1) yields 


Es sukee ie d,(S) Z, a FRX), Cie2) 


where 


xX R 
Bp AVX) (3) 
x 


The term Ld,(s)z, in (1.2) reduces to zero, but it is 
retained for variance estimation. On the other hand, the last 
term of (1.2) is ignored for variance estimation. Thus, 
ie - Y is represented as »d,(s) z, = Y(z) for the purpose of 
variance estimation. Denoting an unbiased variance esti- 
mator of Y = Y( y) as v(y), Binder’s variance estimator of Y a 
is given by v(z). The linearization variance estimator v(z), 
obtained from (1.3), agrees with v,, for simple random 
sampling and stratified multistage sampling if the sample is 
treated as if the primary sampling units are sampled with 
replacement. Note that the jackknife method is not appli- 
cable generally for any sampling design. 

For the estimator 6 = aoe oe. i) of a smooth function 
oftotals, 0 = o(¥ 42...) Pinder s (1996) method leads to 


6-02) dls)z,+ 


with 


, is 2, (28(a)/2a 5) Vip 
where Y = Cae. ss: and a=(a,,...,a,,)". It follows 
from (1.4) that tie ‘partial Henaates Og(a)/da., are 
evaluated at Y to obtain Z, 8, Whereas in the standard 
method (see e.g., Andersson and Nordberg 1994) they are 
evaluated at Y =(Y,,..., Y,,)’ before getting z, and then 
substituting estimates for the unknown components. For 
example, for the ratio estimator Xe the term X/X disappears 
from Pi in the standard Bee tire because X/X becomes | 
when X is replaced by X. 


(1.4) 


Although Binder’s (1996) approach is simple and attrac- 
tive, a more rigorous and broadly applicable method is 
needed. In section 2, we propose an alternative approach 
that is theoretically justifiable and at the same time leads 
directly to a v,, -type variance estimator for general designs. 
We apply the method, in section 3, to a variety of problems, 
covering regression calibration estimators of a total Y and 
other estimators defined either explicitly or implicitly as 
solutions of estimating equations, e.g., estimators of logistic 
regression parameters with design weights calibrated to 
known auxiliary population totals. We also obtain a new 
variance estimator for a general class of calibration estima- 
tors that includes generalized raking ratio and generalized 
regression estimators. Section 4 extends the proposed 
method to two-phase sampling to obtain a variance esti- 
mator that makes fuller use of the first phase sample data 
compared to traditional linearization variance estimators. 

For the case of independent and identically (1id) random 
variables y,, ..., y, with distribution function F(y), estima- 
tion of general parameters 0 = T(F’) has been studied exten- 
sively in the literature (see e.g., Huber 1981). A natural 
estimator of 8 = T(F) is 6 = T(F’), where F( y) is the empi- 
rical distribution function given by F(y)=n ! Xi, 
IQ, ey) with [(y,< y) lil ys yand Gy ey) Oak 
y,>y. For example, if 7(F) is the population mean 
[y4FO). then T(F) = [ydF(y) =n 1dy_,y,=y, the 
sample mean. Note that PF assigns equal mass, 1/n to each 
of the sample values y,,..., y,,. If Tis “sufficiently regular”, 
then T(F) may be linearized near F in terms of the 
influence curve (or function) of 7(-) given by 


IC(y,F,T) = Bi i) aay -T(F)| /a, (1.5) 

where 6, denotes the point mass | at y. We have 

va(T(F) - TY) = yn [IC(y,F,T)dF(y) +VnR, 
spaalig Syst (1.6) 


fn kl 


where Z, =IC(y,, F, T) and yn R,, is a remainder term. If 
yn R,, is asymptotically negligible in the sense that jn ie 
converges in probability to zero as n> (denoted 
Vn R,,~,0) then it follows from (1.6) that /n[T(F) - TF)] 
is asymptotically normal with mean 0 and variance 


A(F,T) = [[IC(y,F, T)PaF(y), (1.7) 


noting that the terms Z, in (1.6) are iid random variables. 
As noted by Huber (1981, page 13), VnR, is “often” 
asymptotically negligible, but the proof of this property may 
not be easy for general functionals 7(F). Serfling (1980, 
section 6.2) gave the following two conditions for 
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yn wR, 0, applicable for general random variables 
Y}> ++: ¥, (not necessarily iid): (i) 7(-) is “stochastically 
differentiable” at F; (ii) /n sup|F(y) - F(y)| is bounded in 
probability, where sup is over y. Condition (11) is satisfied 
in the iid case, but it may not be easy to prove (ii) for 
complex sampling designs. Condition (i) means that there 
exists a functional 7(F; F,-F) such that T(F,) -T(F) = 

ne eT. 6, —F)+R,, where R,, is of lower order in 
probability than suplF,, ( ae F(y)| as the latter tends to zero. 
This condition may not be easy to verify for general T(-). 
Serfling (1980) suggested that in practice it is more 
effective to analyse R, directly using “the method of 
differential inequalities”. 

A natural estimator of the asymptotic variance A(F,T) 
is 

A(F,T) = 


Ly acy, ANE, (1.8) 


N k=1 
where IC(y, F , 7) is the influence curve evaluated at F = F. 
It follows that a linearization variance estimator of T(/’) is 


v (TF)] = ACP, T)/n. (1.9) 


Practical implementation of v, (T(F )] involves the compu- 
tation of IC(y,, F, T) for each specified T. The latter can 
be avoided by using the jackknife method. Substituting F 
for F and -1/(n-1) for a in (1.5), we obtain a jackknife 
estimator of IC(y,, F,T) as z,, = (n- LTP) - Tet 
where F _,(y) is the empirical distribution function 
obtained when y, is omitted. The resulting jackknife 
variance estimator TF ) is 

De Ze 


T(F aes ee 
v [TP )] aGra1) os 


2 EES TS) AT GT) 
nN k= 

see e.g., Hampel, Ronchetti, Rousseeuw and Stahel (1986, 

page 95). If IC(y, F, T) does not depend smoothly on F, 

then the jackknife variance estimator may not be consistent 

for the variance of TF ); for example, when TF ) is the 

sample median. 

Campbell (1980) attempted to extend the above results 
for the iid case to general sampling designs, using the 
design weights d,(s). The population (or census) parameter 
§ is now given by 0 = 7(F,,), where F,(y) is the popu- 
lation distribution function that assigns equal mass, 1/N, to 
each of the N population values y,, ..., y,. An empirical 
distribution function is given by F(y) = De d, SCY, Sy) 
where dae d,(s)/ Les d(s) are the Me ealen design 
weights. Note that F(y) assigns the mass d,{(s) to the 
element kes. An estimator of 0=T7(F,) is given by 
6 = TF ). For example, if 7(F,,) is the population mean 


EE) 


[9 dF), then TH) = [yd FY) = Xpep AO! Epes (0), 
the design-weighted sample mean. Campbell (1980) 
followed the linearization (1.6) for the iid case and 


concluded that jn (TF ) - T(F,)] is asymptotically normal 
with mean O and variance 


daz, » 74,09 


kes 


AW, 1) =n Var 


(1.11) 


R 


nVat| dryer 4s) {(E,- RYN} ], 


using the approximate variance of a ratio, where 
R= ,.,%,/N is the population mean of Z,’s and 
z= ICC Vp>: F,, I). Denoting the unbiased variance estima- 
tor of Y = Y(y) = ves 4l5)y, a8 V(y), it follows from (1.11) 
that a linearization variance estimator of T(F ) is given by 


v,[TF)] = vi(z-R)/N], (1.12) 
where 
Pea ICG et 1), (1.13) 
and 
Ree OG ne oe (1.14) 


To avoid the computation of z,’s, Campbell (1980) 
proposed a jackknife estimator of z, for each kes. It is 
given by 


1-d,(s)__ : 
=a eae TF) - TF 1.15 
LKI d,(s) [ ( ) ( is ( ) 
where 
dF(y) -d,(s) ; 
f 0,9). adloniace “si 
1G) a 
dF(y) if 1.16 
1-46) if y # y. (1.16) 


The resulting linearization variance estimator is given by 
Vinee R - /N]. Note that the proposed jackknife method is 
different from the customary jackknife for survey sampling. 
For example, for stratified multistage sampling, the custom- 
ary jackknife deletes sample clusters in turn whereas the 
Campbell method deletes elements in turn. Also, the cus- 
tomary jackknife is not always applicable (e.g., unequal 
probability sampling without replacement) unlike the 
Campbell method which uses the unbiased variance 
estimator v(y) of the total Y for the given design and then 
replaces y by (z, ~R DIN. However, the computations 
involved in the Campbell method can be very heavy 
because it requires the computation of T(F _,) for each 
element k€s; in large-scale surveys the number of sample 
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elements can be very large, as in the Canadian Labour 
Force Survey. 

Deville (1999) and Berger (2002) obtained results very 
similar to those of Campbell (1980). Instead of using the 
natural probability measure F, they considered functionals 
of the form TM ), where M denotes a measure that 
allocates the design weight d,(s) to any point y, for k in s 
and zero to units k not in s. For example, 7(M) = 

xdM(x) =¥ d,(s) y, if the population parameter is the 
total TM) = | xdM(x) = Y, where the measure M allocates 
a unit mass‘to each of the N points y, in the finite 
population U. Suppose that 7(-) is of degree a in the sense 
that N “ 7(-) tends to a limit for some a = 0. Typically, 
a=0 or 1; for example, a = 1 if T(M) is the total Y and 
a =0 if 7(M) is the ratio R = Y/X. Deville (1999) used the 
following asymptotic approximation: 


/nN | TM) -T(M)| = 2S (ay)- 1)%, (17) 


where d,(s)=0 if k is not in the sample s. Further 
Z = ITM; y,) with IT denoting the influence function of 
T(M) defined by 


IT(M; y) = lim [Ta +18,)- TO]. 4.18) 


As noted earlier, it is not easy to justify the approximation 
(1.17) for general functionals T7(-). Deville (1999) 
developed rules for evaluating IT(M;y) for selected 
functionals 7(M). Berger (2002) used the jackknife method 
to estimate z,= IT(M, y,), similar to Campbell (1980). 
Noting that > d,(s)% = Y(Z) it follows from (1.17) that 
a Spee variance estimator of N-* T(M) is given by 
 W(Z). But Z Z, depends on unknown parameters and the 
abe fei estimator, 2 “May not be unique. For 
example, suppose 7(M) = Y, = (¥/X)X, then a =1 and Z, = 
y, ~ Rx,, where R = Y/X. In this case, two possible candi- 
dates ee i Mery = Rx, ane as (X/X) (y,-RXx,). 
Thus, the choice of z, in the presence of auxiliary 
information, such as a known total X, is not unique under 
Deville’s approach. Unlike Deville’s approach, our method 
leads to a unique choice z, and it avoids the calculation of Z, 
to determine z,. Our z, satisfies desirable properties 
mentioned in section 1, at least in a number of important 
cases. 


2. THE METHOD 


To motivate the method, we start with a simple general 
case where the estimator 6 of a parameter ®@ can be 
pees as a Noga function g( Y ) of estimated totals 

cz = Se, 23 oe )’, where ae Daa dss 


L=il jms aa some of the total Y, = »,_,,y,, and 
Q = 2(Y) with Y =(Y,,..., Y,,- Y,,)”. We may write 6 as 
0 =f(d(s),A y) and 0 = 605 “ y where A , isan mxN 
matrix with &* column ivi Gere apy pi Mey 
ke 1h Neds) eld, (6) d,(s))’ and 1 is the N-vector 
of 1’s. For example, if 8 denotes the ratio estimator 
Y, =((Ld,(s) y, MX d(s)x,1X, then m=2, y,,=Y,; 
i x, and f(1,A,) reduces to the total Y, noting that 
(Y/X)X = Y. Note that Ye is a function of d(s),y and x 
and the known total X, ie we dropped X for simplicity and 
write Ye = f(d(s),y,X). 

Tayo: linearization of 6 around Y gives the approxi- 
mation 


\nN“@-8) ~ e (2g(a)/ea)"|,.y (P-¥) 1) 


where dg(a)/da = (dg(a)/da,, ..., dg(a)/da,,)" and Ng (-) 
tends to a limit for some a > 0. Asymptotic normality of 
(nN er (6 -9) follows from (2.1), provided a central limit 
theorem for JnN as (Y -Y) holds and g(-) has continuous 
first derivatives in a neighbourhood of the mean Y. 
Krewski and Rao (1981) justified (2.1) for stratified 
sampling. 

Let's! Y= Db .¥, for arbitrary real numbers 
b= (bite Mara atta e(Y) =f(b,A y) =f(6). Noting that 
Y-A ds) and Y = A, 1, we can express (2.1) as 


A ,(d(s) -1) 


Sg 
N 
= > (atby/a¥)" baa 9x (4x69) ~1), (2.2) 


noting that Y = Y is equivalent to b = 1. Now we substitute 
a O¥/0b,| in (2.2) to get 


(nN * (6-0) = , (f(b) fob ble. (4s) -2) 


=< 
ah iMe= 


" (d(s)-1), (2.3) 
where Z = (Z,,...,Z)’ with Z, = Of(b)/ob, |, _,- 

A variance estimator of the right hand side of (2.3) is 
given by (n/N”)v(Z), where v(Z) is the variance estimator 
of the estimated total d,(s)Z, = Y(Z). Since Zs Siatre 
unknown, we replace Z, by z, = 0f(b)/db,|, -a(s) t© get 
(n/N7)v(z). Thus, a linearization variance estimator of 6 is 
given by 


v,(6) = (N7*/N?)v(z), (2.4) 


which reduces to v(z) if a = 1. Note that v6) given by 
(2.4) is simply obtained from the formula v(y) for ¥ by 
replacing y, by z, for kes. Note that we do not first 
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evaluate the partial derivatives of(b)/db, at b =1 to get Z 
and then substitute estimates for the unknown components 
of Z. Our method, therefore, is similar in spirit to Binder’s 
approach. The variance estimator v (6) is valid because z, 
is a consistent estimator of Z,. 


Example 2.1 Suppose 6 is the ratio estimator Y, = 
X[(Xd,(s)y,)/(Xd,(s)x,)] of the total Y. Then f(b) = 
X [(Xb, y,)/(Lb,x,)] and 
ens De es DY, 
(db, x,) 


Of(b)/ab, = X 


Therefore, 


oa Of(b)/ Abi 5 -as) ie 


which agrees with (1.3). Thus, our variance estimator 
v, (Ye) is identical to Binder’s (1996) variance estimator, 
v(z), noting that a = 1. 

Our derivation is simple and natural. On the other hand, 
in the standard linearization method, 6 is first expressed in 
terms of elementary components Ve ny as 2(Y ) and 
the partial derivatives dg (a)/oa, are hen evaluated at 
a = Y. It is interesting to note that all the components of Y 
use the same weights d,(s) and our approach always takes 
first derivatives of f(b) with respect to b, at b =d(s). Itis 
not necessary to first express 8 in terms of elementary 
components. 


3. CALIBRATION ESTIMATORS 


The ratio estimator can be viewed as a calibration esti- 
mator, ee =Lw,(s)y,, with explicit weights w,(s) = 
(X/X )d, (s) and satisfying the calibration constraint 
Ew,(s)x, = X. Calibration estimators of a total Y of the 
form Le = Lw,(s)y, with explicit weights w,(s) and satis- 
fying the calibration constraints Lw,(s)x, =X are widely 
used, where x, = (x,,, a, and X = (X,, fex)e is the 
vector of known totals of auxiliary variables Xj, J = ENS ee 
In subsection 3.1 we consider the generalized regression 
(GREG) estimator and then study a general class of 
regression calibration estimators in subsection 3.2. 
Extension to estimators, 6, obtained as solutions of esti- 
mating equations is presented in subsection 3.3. The case of 
general calibration estimators is investigated in subsection 
3.4. 


3.1 Generalized Regression Estimator 
The GREG estimator of total Y is given by Y. with 
calibration weights w,(s) = d,(s)g,(d(s)), where 


g(d(s)) = 1+(K-¥)"(Ladsyo,x, xp) ex, BL) 


21 


with specified constants c, and X=Dd (8) x, (cf., Sarndal 
et al. 1989). The ratio Saucon A , 1S a special case with 
g=\(i.e., scalar x ,) and O seach , and g,(d(s)), given by 
(3.1), reduces to XIX. 

The GREG estimator may be expressed as a diffe- 
rentiable function of estimated totals. Hence, the general 
theory of section 2 is applicable and it remains to evaluate 
Z, = Of(b)/Ab; |, -ais> where f(b) = 2 (b,8,(b))y, _is 
obtained by replacing d(s) by b in the formula for Y . 
Noting that dA(b) '/ db, = -A(b)' (GA(b)/db,) A(b) |, 
where A(b) = b,c, x,x,, we get 


A(b,¢,(b))/db, 
= g(b)-x,A(b)' b,c, x, 


- (X-X(b))"A(b) '(c,x 
and for 1#k 
A(bg,(b))/db, 


eX) A(b) 1 (b,c,%,) -2) 


= -x,A(b) '(b,c,x,) 


- (XK -X(b))"A(b) "(c, x ,x,) A(b) | (b,c, x). 3.3) 
It now follows from (3.2) and (3.3), that 


of(bylob, = g,(b) e,(b), (3.4) 
where 
e.(b) = y, -x,B(b) (3.5) 
with B(b)=A'(b)(X,b,c,x,y,)- Therefore, z, = 
Of(b)/0b; |, -ais) Teduces to 
= g(d(s))e,, (3.6) 
where e, = y,-x iB with B = B(d(s)). 


The variance estimator of Fi , resulting from (3.6), 
namely v(z), takes account of the g-weights, 2,(d(s)), 
unlike the standard linearization variance estimator (see 
e.g., Sarndal et al. 1991, page 237). It agrees with the 
model-assisted variance estimator of Sarndal et al. (1989). 
It also agrees with the jackknife linearization variance esti- 
mator when the latter is applicable (Yung and Rao 1996). 


3.2 A General Class of Regression Calibration 
Weights 


We now turn to a general class of regression calibration 
weights of the form w,(s) = d,(s) h,(d(s)) with 


h(d(s)) = 1+(X-XYO"'(e,x 4+ Vi, GNC 1) B-7) 


where the ab-th element of i) is given by 
5 = Doe 1 ENC Xa Xx * ye 1 ae AL S)ALS) Cy Xap Xp) 
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for specified constants c, and c,,(=c,). The class (3.7) 
covers the GREG estimator as well as the “optimal” linear 
regression estimator with d,(s) = (1/m,)a,(s). In the former 
case c,, = 0 while the optimal linear regression estimator 
uses c, = (1 -7,)/n, and c,, = (1,,-1,7,)/1,,, k + l, where 
T,, 18 the probability of including both elements & and / in 
the sample s (Montanari 1998). 
The calibration weights w,(s) may be rewritten as 


w,(s) = d,(s)+(X -X)'O" 


[as)e,x, + yo d,(s)cyx,), (3.8) 


where d,(s)=d,(s)d(s)/E[d,(s) d,(s)], Gn =c,,E[d,(s)d(s)] 
and 


A NV N N x 
pei Diet, AS) CX gp X py + ae iy AS) Cy XqX pr 


Note that Ed,(s) = 1 and Ed,((s) = 1. If d,(s) = (1/m,)a,(s) 
then d,,(s) reduces to d,(s) =a,(s)a(s)/n,, and oe = 
(Ty - 1,%,)/(,7,). We can regard the calibration estimator 
Y,, resulting from (3.8) as a function of totals, by 
expressing a quadratic form as a total of synthetic variables 
(Sitter and Wu 2002). Therefore, we can use the method of 
SeCuc te heOr a 2 and Wetyiv tee 
Y,, =fGs),d(s),y) =Ld (hd), ds) y, 
where d“(s) = d(s) and ds) is the vector of elements 
d,,(s), k<l, arranged in a sequence. Now, following the 
derivation of (2.3), we get 


CN DINAH ORM RP Spe Peso MO lyk) 


where 


re af(b,b,y)/ db, oO =1,92 <1 
et Of(b,b,y)/ By lyoaa Oe 


b” =b =(b,,...,b,)’ and b© is the vector of arbitrary 
real numbers b,, k <J/, arranged in the same order as the 
elements d,s) in d”)(s). Using (3.9), a variance estimator 
of Y,, is approximately given by the variance estimator of 
dihl d.(s) st, 2p De woe Zed fs) denoted bysvi ies): 

Since v(Z"?, Z) involves the unknown values Z, and Z,,, 
we replace Z, by z,=0f(b", b™, y)/ ab, |w -d%§), b® =d%5) 
and Z,, by 2, = f(b, b®, y)/db,; |. =d%s),b? =d%s) neg 
v(z, 2). Unfortunately, the variance estimator 
v(z‘)), z) involves third order and fourth order moments 
E(d,(s)d(s) d,(s)] and E[d,(s)d,(s) d,(s) d_(s)] in addition 
to the second moments E[d,(s)d,(s)], whereas the variance 
estimator for the generalized regression estimator requires 
only the second moments. In particular, if d,(s) = (1/1,) a,(s) 
we required third and fourth order inclusion probabilities 
Trig and 7, Igr aS well as the second order inclusion proba- 
bilities 7, ,. 


The calculation of z, and z,, involves the derivatives 
d[b,h(b™, b™)] /db, for 1=k and 1 +k and the derivatives 
o[b,h(b™,b™)]/db,, for 1=k and 1+ k. After simpli- 
fication, we get 


1+(X -X)' Oc, x, e, 


and 
2 = (X - XO cy, Xe 
where 
é = y,-x, Bb 


with B”=Q (©, d,(s)c,¥ 92+ ppp GOES) Cy Y,)- 
Note that the customary Taylor linearization variance 
estimation uses v(e *), while v(z, z) would involve the 
residuals e, as well as the g-weights 1 +(X-X)'O | Cx 
and (X-X)'Q San If ¢),= 0 tor all-k #)}, thenyz,=0 
and v(z‘?,z) reduces is vO) with z, given by (3.6). 
Thus the GREG result of subsection 3.1 is a special case. 


3.3 Estimating Equations 


We now turn to a vector parameter 0 = (0,,..., 0)" 
defined either explicitly or implicitly as the vat ee iS 
“census” estimating equations S (0) = pra , 4, (8) = 
calibration estimator 6 = (6, :) pi with GREG ie 
tion weights w,(s) = d,(s) ¢,(d(s)) is obtained as the 
solution to sample estimating equations: 


S$) = ¥ w,(s)u,(6) = 0, 


where u (8) and §(6) are (px 1) vectors (Binder 1983). 
For example for logistic regression with scalar 8, we have 

u,(9) = (Y, = P,(9))a,, where p,{9) = P(y,= |a,) = 
exp(0a,)/ dd +exp(@a,)) and a, is the predictor variable. 
Note that 6, in this case, 1s the implicit solution to (3.10) 
and obtained iteratively using Newton-Raphson or Fisher 
scoring method. 

The estimator of a ratio of totals Y and A =a iS 
obtained as the explicit solution of G3. 10) with u,(9) = = 
¥,— Oia 6 = Lw,(s)y,/Lw,(s)a, = Y/A. In this case, 6 is a 
function of estimated totals and hence our method for 
functions of totals is applicable. It remains to evaluate 
Of(b)/db,, where f(b) =Xb,¢,(b)y,/Xb,g,(b)a,. We 
have 


Af(b) /3b, = Ep. [8(b,g,(b)) /db,] A(b) | (y, -f(b)a,). 


(3.10) 


where A(b) = »b,g,(b) a). Now using (3.4) and (3.5), it is 
easy to verify that z, reduces to 


paps g,(d(s)) A” ey 


where 


* 


é; = u,(9) -xiB 
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with B , obtained from B by changing y, to u,,(8). Note 
that the residuals e, has the same form as the GREG 
residuals e, with y, changed with u,(8). 

In general, the solution @ to the estimating equations 
(3.10) may not be expressable as a function of estimated 
totals. We therefore follow Binder’s (1983) approach and 
write the linearization estimator of the covariance matrix of 6 
as 


v (6) = [FG] E, [Fy]. (3.11) 
where J (0) =- ag (8)/00 and. y ;(8) is the estimated 
covariance matrix v iS (0)) = > (8) evaluated at 0 = 6. 
Binder (1983) gave regularity conditions for the validity of 
(3.11). Noting that S (9) is a vector of estimated totals with 
GREG weights d,(s) g,(d(s)), it follows from (3.6) and 
(3.11) that 


v,(8) = v(z) (3.12) 
where 
= | F6)| eo (ds)ye% (3.13) 
with e, = (eq, €,)" and 


ex, = u,,(8) tg? Rey = 1 Cae. 

Further, B., is obtained from B, by changing y, to TG) 
and v(z) is the estimated covariance matrix of the vector of 
estimated totals Z = Ld (s)Z,, where u (8) is the j™ 
element of u (6). The result (3.12) agrees with the 
jackknife fhedeehtion variance estimator, v,,, for stratified 
multistage sampling obtained by Rao, Yung and Hidiroglou 
(2002). 

The result (3.12)-(3.13) may also be obtained directly by 
writing 6 as f(d (s)) and evaluating z, = 0 f(b)/ ob, 
We __ denote 6(b) =f(b) as the 
» (b,8,(b)) u,(9) = 0, ie., 


Yo (b,8,(b)) u,(8(b)) = 0, (3.14) 
We now take the derivative of (3.14) with respect to b, to 
get 


b =d(s)’ 
solution of 


N N 
d [9(b,8(b)) /0b,]u (8(b)) +d) (b,8,(b)) 
(En 


l=1 


| du (6(6)) /6(6(b)) |6(6(b)) /8b,. (3.15) 


Substituting (3.2) and (3.3) for d(b,g,(b))/db, in (3.15), 
we obtain (3.13) after simplification. This result shows that 
our method is also directly applicable to general estimators ) 
under Binder’s (1983) regularity conditions. 


3.4 A General Class of Calibration Estimators 


The calibration weights, w,(s), associated with the 
GREG estimator ie may not be always nonnegative. To get 
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around this difficulty, generalized raking ratio weights are 
often used. These weights are always nonnegative, but the 
method can lead to some extreme weights (Deville and 
Sarndal 1992). 

The generalized raking weights belong to the class 


ws) = d,(s) F(x, 4) (3.16) 


with F(a) =e", where the LaGrange multiplier i is 
determined by solving the calibration equations 


Dea Ge mew a aise A) x, =X. 


The GREG weights correspond to F(a) = 1+a in which 
case 4 =(Ld,(s)x,x,) (X -X). 

In general, the calibration estimator Y, = Lw,(s) y, with 
weights w,(s) given by (3.16) may not be expressable as a 
function of estimated totals. We therefore follow Binder’s 
(1983) approach and expand F (ci) around A, where A 
denotes the probability limit of 1. We get 


F(x, d) = F(x, 0) +f, dx, -J), 
where f(a) =0F(a)/da. Further, by expanding the 


calibration equations (3.17) around A, we obtain after 
simplification, 


BAT) 


(3.18) 


i-n = -G(8, -x) (3.19) 


where Ons dud a flags pay oe and S, =¥,d,(s) 
F(x, 1) x,. Note that both 0, and S, are of the form of 
estimated ales Substituting (3.19) into (3.18) gives 


F(x; h) = F(x, ) ~ flap) x0, (S,-X). 3.20) 
Using the approximation (3.20) in (3.16), it follows that Y,, 
is approximated by a differentiable function of estimated 
totals. Hence, the general theory of section 2 is applicable 


and it remains to evaluate z, = dh(b)/db a? =4(3 where 
h(b) = Xb, 8, (b)y, with 


gy (b) = F(x;A) - fla) x,Q,(b) *(S,(b) - X) 


where 2,.(6) = Deron hire x, and = S,(b) = 
yy F(x hee ,- After simplification, we get 
a= EL gsxr By Fe Nel een’ B21) 


where 


B, = (© ds fla{Dx, x1) do dy(s) fla, x,y, 
Singh and Folsom (2000) obtained a similar result, using a 
somewhat different approach. 

The result (3.21) may also be obtained directly along the 
lines of (3.2) and (3.3) by writing a as f(d(s)) and evalu- 
ating z, =0f(b) /0b al, La nhere f(b) =X b, 3 (b)y, 
with g,(b) = F(x, i(D)). We have 
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A(b, g,(b))/db, = g,(b)+ b, f(x, AB) x ,(04(b)/Sb,), (3.22) 


and for /#k 


0 (b, g,(b))/db, =b, f(x; K(b))x | (GMb) /db,). (3.23) 


To evaluate di(b)/ab,, we take the derivatives of the 
calibration equations (3.17) with d(s) replaced by Bb: 
Dy F(x; 4(b))x,-X = 0. This gives 


0 = F(x, Mb))x,+ D>, b, fx] Mb))x x; (24(b) /db,) 
or 


dh(by/ab, = (> b, Foe {A(b))x xh) Foe (hb))x,. (3.24) 


Substituting (3.24) into (3.22) and (3.23), we get (3.21) 
after simplification. 

Deville and Sarndal (1992) showed that the asymptotic 
variance of By for general F(-) is equivalent to the 
asymptotic variance of the GREG estimator which involves 
the “census” regression coefficient B. Using this result they 
obtained a variance estimator of Ly for general F(-), by 
replacing B by B=(Xw,(s) x, xi) 1D w,(s)x,Vp where 
w(s)=d,(s) F(x vb The resulting z, agrees with our z, 
given by (3.21) if f(a) = F(a), i.e., in the case of generalized 
raking weights. In the case of GREG LE Oy we have 
F(x) =1+x, f(x)=1 and Ae (20S) x5, ‘on UXEX ): “It 
readily follows that F(x 1A) reduces to the eel 
g-weight BE) = PFs xj (Edy(s) x pj)” Xie and 
8 yx, 1B reduces _ to pt yume: 'B with 

RORY oS eee Note that our z, in this 
case is different from the z, of Deville and Sarndal (1992), 
but agrees with a commonly used z, (Sarndal, Swensson 
and Wretman 1989). 

Our method, along the lines of section 3.3, can be 
extended to implicitly defined estimators, 6... obtained as 
solutions to estimating equations (3.10) based on the 
general calibration weights (3.16). Details are omitted for 
simplicity. 


4. TWO-PHASE SAMPLING 


We extend our method to two-phase sampling, assuming 
the estimator 6 of a parameter 9 can be expressed as a 
differentiable function, oY Xe ), of estimated totals, 
Y = Ce, ‘ Fay? from the second-phase sample and 
estimated tothe PO <( Ke, 4X, eo )’, from the first- ha 
sample _ only. a. y- oe AGS) yas ek, 
Ke = de (8) Xpod = 1,5 P, dy (5,) denotes the first- 
phase design weight Aira to the ee element with 
d,(s,) =O if k is not in the first-phase sample s,, and d,(s) 
is the final design weight attached to the k™ element with 
d,(s) =O if k is not in the second-phase sample s. Further, 


the parameter 0 = 9(Y,X) with Y=(Y,,....Y Ny and 


9G —F'), Fras ay" denoting the vectors of Y- dc X- totals. 
For example, “the two-phase ratio estimator, Vx , 1s of the 
form 6 = ¢(¥,X,X°”): 

5 4 (1) 5 yl) 

R2 = ma = RX 

x 
d,(s) y 
- XA) Ys (> d{%s,)x;). (4.1) 


» dy (S)X, 


Note that EMA with “ = y, if =X, and. ¥ 3X. 
Alcon, =10( 12 \ 2X oy es 

For simplicity, se a g(*) such that N ~ gC ) tends 
to a limit. Taylor linearization of 6 = 2(Y, ae ) around 
(Y, X) gives 


6-0 = o(¥,X°)-9(Y,X) 
= (dg(a,a)/oa)" |, _y gv. (Y -Y) 


+(ag(a,a) faa)" | oy gray (X-X). (4.2) 


Let Y=Db,y, and xX? - Yb, x, for arbitrary real 
numbers b =(b,, ..., by)" and b=, beet. Also, let 
o(¥,X°) =f(b, b®, A,,A,) =f(b,b), hers A, is an 
mxN PALA kn column X,= 
Cre cua te k=1,..,.N, and A, isan pxN matrix with 
k* column y,= Coane Week = ie, . N. Now following 
the derivation of (2.3) and none that Y=A y Us), 
Ya Awl Kans A,d\(s,), X=A,1, it can be showathar 
(4.2) reduces to 


6-0 = Z7(d(s)-1) + Z97(d\s,) - 1), (4.3) 


where , d(s) = (d,(s),....dy(s))’ and. d\(s,) = 
Gore ds, ))F. Further ZZ, zor with w= 
of(b, b™)/db Ae aan: and 7) = = Goes , Ze)? with Z; >(1) _ 
afb ‘Bvab! |, | ,(0.1- It follows from (4.3) that a cat 
ance estimator of 6 is approximately given by the variance 
estimator of the estimated total d,(s)Z, + 
Yds ira #) = yz) +X (EM). We denote the latter vari- 
ance estimator as v(Z, 7“). Now we replace Z Z, and Bee by 
4 = af, b®)12B |p -a(9) =a ANd z= OF, bY 
b; sk -d(s),b =d(s,) respectively, since z, and rae are 
aed This leads to a linearization variance estimator 


v6) =W(z,z"). (4.4) 


We now consider the special case of a “double 
expansion” estimator Y G)=na (5) y, with d,(s) =), ; Ty 
for kes and the Horvitz-Thompson (H-T) estimator 
XW) =Ldi(s,)x, with dps, =n, for kes,, where 
7, , 18 the probability of including element k in s,, and 7,,,, 
is the conditional probability of including element k in s 
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given s,. In this case, an unbiased H-T type estimator of 
Y(y) +X x) is 2 Bon 
Sines 
Woy jx)e > yy Ee US Loge ae & 
kes, TH i Tay May 


yee 


ee yet 
BG ioe 
kes ie Mee pdtix Ty 


ax Mon Ye Yi 


Ne th 
2kI/1 
Pee spe (4.5) 
kes Moxi Tl, T) 
* * . cae 
where m, = Typ Mepis My = My py Mops» Mp) 18 the probability of 
including both elements k and / in s, and 7,,,, is the 


conditional probability of including both elements k and / in 
s given s,. A proof of (4.5) is given in the Appendix. The 
variance acta (4.4) is obtained from (4.5) by changing y, 
and x, to z, and Ze respectively. 


Example 4.1 We illustrate the calculation of v(z,z) for 
the two-phase ratio estimator Y,,, given by (4.1), for the 
special case of simple random sampling at both phases: s, 
is a simple random sample of size n and s is a simple 
random subsample of size m from s,. In this case, 
,, = n/N and 1,,,, = m/n. Further, it follows from (4.1) 
that for general two-phase design, 
yo : yo 
(4.6) 


and 
ZR ap. (4.7) 


Under simple random sampling at both stages, (4.6) and 
(4.7) reduce to z,=(x/x)e, and z,” =(y/x)x, where 
€, = y,-(y/x)x,,y and x are the second-phase sample 
means of y and x respectively, and x‘” is the first-phase 
sample mean of x. Now substituting z, and Za, for y and 
x in (4.5) and nothing that 2,,,=n(n-1)/[NM(N-1)], 


Tp, =m(m-1)/[n(n-1)], 1, =1,, and 1,,,), =Tp,,, We get 
A ab 2 7 
n 
ee ey i (4.8) 
i YIN x 
where 


Reyes (a1) apy Capen)? 
Oy 


Sn Tye eke 
(Ppa dies Tippee (er eu) 
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and e is the second-phase sample mean of e. The formula 
(4.8) agrees with the formula derived by Rao and Sitter 
(1995). It is different from the customary formula 
(Sukhatme and Sukhatme 1970, page 176) which fails to 
make use of the full x-data {x,,kes,}. Rao and Sitter 
(1995) demonstrated through Se aaieion that v(¥ g)18 
more efficient than the customary variance estimator. Also, 
v(Y, y =) performed better in tracking the conditional mean 
squared error of Y R2» See Rao and Sitter (1995, section 3) 
for details of the simulation study. 


CONCLUDING REMARKS 


We have presented a unified approach to deriving Taylor 
linearization variance estimators and applied it to a variety 
of problems. It leads directly to a variance estimator that 
has some desirable properties at least in a number of 
important special cases; in particular, approximate 
unbiasedness for the model variance of the estimator under 
an assumed model and validity under a conditional repeated 
sampling framework. It would be useful to investigate 
whether such desirable properties also hold for more 
complex cases such as the general class of calibration 
estimators (section 3.2), the estimators based on estimating 
equations (section 3.3) and two-phase sampling (section 4). 
We are currently investigating various extensions of our 
method, including variance estimation under imputation for 
item nonresponse and variance estimation from longitudinal 
survey data. 
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APPENDIX 


Unbiased Variance Estimator of Y( y)+ xX My) 


The variance of Y( y) + x Oc) is the sum of the variance 
of ¥ (y), the variance of x co) and twice the covariance of 
Y(y) and X ), An unbiased H-T type estimator of 
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V[Y( y)] is given by Sérndal, Swensson and Wretman 
(1991, chapter 9, page 348): 


4 M1 ,, RIT Oy, ty 
te ig aT Ee 
sige SSA Ya Le Ss et eee 
kles Tes TM, MH) 

1 = Tl Vey 

oxi ann Mon Ye Yi Al 

sy ee 

kles MK T™ 1 


An unbiased H-T type estimator of V[X Oo] is given by 
Ti oh aly, ee A, 


[XP Go| & Ne > La ae 
™ 


(A.2) 
k,les, Ti ik M1 


Further, 
70.2%) 
Cov| Y(y), X~“(x) 


Il 


ECov,| Y(y), XO) | 


+- 


Cov] E,(P()), EX) |, 


where E, and Cov, denote conditional expectation and 
conditional covariance given s,. Noting that 


BOY vig Xs eX (eee 


and Cov, [¥(y), XO] = (0), we get 
Cov] Py), XO] = Cov] RG), XPD]. 


An unbiased H-T type estimator of 2Cov [X(y), XC] 
is given by 


2cov| Ky), XE 


Lp DOS mae Di aD ea 


kles Ty) My tg 


The sum of (A.1), (A.2) and (A.3) equals (4.5). 


(A.3) 
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Comment 


PHILLIP S. KOTT’ 


The article addresses an impressive number of contexts, 
many of which have only recently been investigated in the 
literature, often by Professor Rao himself. I will have little 
to say here about estimating functions with calibration 
weights or two-phase sampling, except (mostly) to agree 
with the solutions advocated in the text. Instead, I will focus 
on three applications: the ratio estimator under simple 
random sampling discussed in the Introduction, the general 
class of regression calibration weights from section 3.2, and 
the general class of calibration estimators from section 3.4. 
I will end with a question about the linearization variance 
estimator in full Horvitz-Thompson form, which has 
bothered me for some time. 


The Ratio Under Simple Random Sampling 


Before beginning, let me confess to a certain skepticism 
about the general method proposed in section 2. I find that 
techniques of this sort work best when you already know 
what the answer is. Godambe and Thompson (1986) tried 
to use estimating functions to settle a controversy then 
surrounding the best variance estimator for the ratio under 
simple random sampling. Using the notation in the text, 
they demonstrated that (X/x)v, was the proper way to 
estimate the variance of a ratio estimator, Ye = (X/x)y. 
Later, Binder (1996) corrected them. He Mitel that when 
done properly, v,, = (iy v, 18 produced from esti- 
mating-function technology. It helped that he already knew 
that was the better answer. 

As Demanti and Rao state, v an has both good random- 
ization (design) and model-based properties (here and 
hereafter I omit the qualifier, “under mild conditions which 
I assume to hold’). In fact, when n/N is ignorably small, 

, has a relative bias of O(1/n) as an estimator for the 
can variance of re If the y, fe uncorrelated, then this 
is not only true Chen Vos o Ris as stated in the text, 
but, more generally, when V, (y,) = on Unfortunately, the 
result is less general when n/N is not ignorably small. In 
that context, when the y, are uncorrelated and V, (y,) = 
Bits a more appropriate estimator for the model varanee 
of Me isnoVes= = [(X/x)? - (n/N)(X/x)] [1 -(n/N)] 3 Ve 
(Kott and Brewer 2001). As an estimator for a 
randomization mean squared error of ge v,, nas a relative 
bias of O(1/,n), just like v,, and v,. 

When simple random sampling is used in practice the 
sampling fraction is almost always small. Thus, v,, is an 
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attractive variance/mean-squared-error estimator, and my 
criticism of Demnati and Rao for advocating it is mild. 


A General Class of Regression Calibration Weights 


I would generalize the results of section 3.1 in a different 
manner than the authors do in section 3.2. Following 
Estavao and Sarndal (2002), replace c,x , in equation (3.1) 
with a vector q, having the same dimension as x ,. The rest 
of that section follows easily. 

One choice for q, is 


Vay = ys (1, ; 
jeu 


the use of which results in a variant of the randomization- 
optimal regression estimator proposed By Tillé (1999). 
Observe that Cavan ee (aan) [Var (X)]? 
Cov Oe ye where Var and Cov denote randomization- 
based properties. 

Another choice, investigated indirectly by Demnati and 
Rao and likewise resulting into a variant of the randomi- 
zation-optimal estimator, is 


Gaye = pe (Meyj ~ MT;)X ; (MM). 
jJés 


~ 1,1) X ,/ (1,1; )s 


Since V2 is a function of the sample, the authors take us 
through the complications of section 3.2. This was only 
necessary for randomization-based inference. I would have 
ae a different way. Observe that ALS) a ~ d,(s) Gye = 

O,UA/n ). Replacing one for the other has an asymptoti- 
eal ignorable effect on w,(s) (ie., the relative difference 
is O,(1/n)). 


A General Class of Calibration Estimators 


A mild generalization of equation (3.16) allows 
calibration weights of the form, 


w,(s) = d,(s)F(q; 4); 


where q, again has the same dimension as x,. For 
convenience F is assumed positive and twice differentiable 
around q,h. Without loss of generality, one can assume A 
(the limit of 4) is 0, and f(0)>1. When Y,. = Dw, (s)y, 
is a randomization consistent estimator, as I assume it is, 
F(O) is equal to 1. 

Paralleling the development in the text leads ultimately 
to 


ee ed KY, 2B) Fd, Wen. 
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where B, = [Ld,(s) fq, axl | Edy (FG, Nyy: 
The presence of the f(-) in the expession of B, may be a bit 
of a surprise, but, it turns out, not a meaningful one in this 
context. For inference under the prediction model, 
Ey.) =r ap. the derivative can be replaced by any 
constant without asymptotic consequence; B, remains a 
model unbiased ues for B. For randomization-based 
inference, since g, "h =O p(1/yn) and F(0), Me as 
would be unaffected asymptotically aod q, hd) were 
replaced by 1 or by F CA k hy: 

Things change, however, if we push the envelop a bit. 
Fuller, Loughin and Baker (1994) use calibration to adjust 
for unit nonresponse by treating sample response as a 
second phase of sampling. They assume that every element 
k in the population has a Poisson probability of sample 
response, ,,, which is independent of whether it is 
actually chosen for the sample. They further assume 
Taps Ait, +x5N)s where 2 is unknown and implicitly 
estimated by pales: Here we generalize that and 
assume 1,, = 1/F(q a: where F is known, positive, and 
twice aes In practice, q, will likely be identical 
to x,, but it may be reasonable to replace one of more 
components of x, with variables conjectured to be more 
strongly correlated with response/ nonresponse. 

Redefining s as the respondent sample and d,(s) as 
(1/m,,) when kes,0 otherwise, pe Ne Ty Cutie Ppceeds as 
before. The difference is that f(q, rd) in B, need no longer 
need be asymptotically identical across the 7. Thus, the term 
can matter even with a large sample. 

see viy sc) * V(X d,s) Z,), where ©,,d,(s) z, = 

d(s) F(q Me, is the double expansion estimation. 
ues 1/F (q,) for 7,,, the variance estimator for Y eC 
becomes (from equation (A.1) with 7, Kin = Mong Mae T;) 


viXee) = SS [M5 ~ Mp %1/)/ My) 


kjes 


d,(s)F(q; Ned, (s)F(q; Aen, 


Do mF Q ME LF QEI 4) ea) 

Es 

This differs from the variance estimator in Folsom and 
Singh (2000) mainly because those authors assume the 
original sample is chosen using a stratified multistage 
design employing with-replacement sampling in the first. 
That, among other things, annihilates the second summation 
on the right hand side. 

Not only does v(Y et estimate the quasi-randomization 
mean squared error of Y..— “quasi” because a response 
model is assumed, it also estimates the model variance of 
Y eae fact, the relative bias of v(¥, S under the 
prediction model, Ey |epdy) = x,B, is O(1/n) when the y, 


are uncorrelated and V, (y,|*54,) = xUY, where y (like B) 
need not be specified. Surprisingly, the second term in 
0 ~) provides the model-based correction I 
recommended for the ratio estimator under simple random 
sampling in the absence of nonresponse. 


Does the ‘Plug-in’ Variance Estimator Really Work 
for the Full Horvitz-Thompson Form? 


As I warned parenthetically early on, I have omitted the 
key phrase, “under mild conditions which I assume to 
hold,” repeatedly in these comments. Now, I want to turn 
my attention to what may be one of those conditions. It is 
standard in variance estimation to replace population (or 
model) values with sample analogues since their difference 
is asymptotically ignorable. That is done, for example, by 
Demnati and Rao in equation (2.4 ) when they plug in z, 
for Z,. The question I want to raise, and for which I do not 
know the answer, is this. Suppose one is estimating a total 
with a calibration estimator. The total is O(N), and 
O(n) = O(N). The estimator’s model variance and 
randomization mean squared error are also O(n). Is it 
legitimate to plug in z, for Z,, where z,-z, = O,(1 //n), 
when there are n(n-1)/2 terms in the Horvitz-Thompson — 
or Yates-Grundy — variance/mean-squared-error estimator? 
In most practical applications, this is a non-issue, because 
the variance estimator can be re-expressed with O(n) 
terms. What if that is not the case? 

Let me conclude these remarks by thanking Drs. 
Demnati and Rao for their stimulating article and Survey 
Methodology for both publishing it and allowing me to 
provide some comments. 
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Comment 


BABUBHAI V. SHAH! 


This is an excellent paper that removes the mystery 
underlying Taylor linearization. Most data analysis 
applications use Horvitz-Thompson weights that are 
reciprocals of the probabilities of selection. The simplest 
prescription for deriving the linearization for an estimator 6 
is as follows: 


iy For each observation, create a new variable 
= 36/ dw,, where w, is the reciprocal of the 
selection probability for the i-th observation 
selected in the sample. In cases where the esti- 
mator 6 is defined implicitly through estimating 
equations, the derivative can be computed by 
differentiating the implicit equations. 


ph Define weighted T = » w, z, total. 


3: Compute the variance V of the total 7 based on 
the sample design. 


4. The variance V is the approximate variance of the 
estimator 0. 


If the parameter 0 is a vector then the variable z, and the 
total T are also vectors and V is an approximate estimate of 
the variance covariance matrix of the estimator 6. 

The steps (1) and (2) specified above produce the correct 
linearization in the following cases: 


a. Means, proportions, and ratio estimates. 
b. Generalized linear regression models. 
c. Predicted marginal for generalized linear model. 


d. Estimate of the mean from regression imputed 
data. 


i Babubhai V. Shah, SAFAL Institute, Inc. E-mail: babushah @earthlink.net 


e. Generalized linear regression models with 
calibrated weights. 


f. | Wilcoxon two sample rank sum test. 


g. Estimates of coefficients and the hazard rate in 
Cox’s proportional hazard model. 


h. Estimates of predicted marginal survival in 
Cox’s proportional hazard model. 


i. | Two-phase sample survey. 


The derivation in the step (1) is uniquely defined and 
does not contain the true value of the parameter 8, and does 
not require substitution by the estimator 6. 

The independence of step (3) for variance computation 
from the linearization in steps (1) and (2) is aptly 
demonstrated by the discussion on two-phase sampling in 
section 4. In most cases, one assumes with replacement 
sample design to estimate the variance of the total in the 
step (3). Of course, a better estimate of the variance of the 
total may be obtained by using all the available information 
about the sample design. For the case of a two-phase 
design, step (1) can be performed by using Horvitz 
Thompson weights for the phase one sampling, and treating 
the multipliers m, as data. The multiplier m, is equal to 
zero if the observation i is not selected in phase two and is 
equal to the inverse of the conditional probability Ty j- Lhe 
resulting step (2) produces the same total as presented in the 
paragraph between equations (4.3) and (4.4). The sub- 
sequent discussion in section 4, describes the appropriate 
way to estimate the variance of this total for a two-stage 
sample design without replacement at each stage, and that 
calculation is independent of the linearization. 

The steps (1) and (2) generate appropriate linearization 
in all known cases except where the estimator is not a 
continuous function of the weights w,, e.g., quantile. 
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Comment 


CHRIS SKINNER’ 


Linearization and replication approaches provide two 
broad classes of methods for variance estimation in surveys. 
Both have their relative advantages and it seems important 
to keep a place for both in the survey statistician’s ‘toolkit’. 
This paper deepens our understanding of linearization 
methods, proposes a general procedure to generate such 
variance estimators uniquely and provides valuable 
illustrations of this procedure in some important areas of 
application. 

A linearization method approximates the variance of a 
statistic of interest by the variance of a linear statistic, for 
which it is assumed a suitable variance estimator is 
available. The main issue here is the method used to 
determine the linear statistic. The standard approach 
assumes the statistic of interest may be expressed as a 
differentiable function of a vector of linear statistics (of 
fixed dimension) and uses Taylor series expansion to 
determine the approximation. The approach proposed in 
this paper applies to a more general class of sample- 
weighted statistics, illustrated by the complex examples in 
sections 3.2. and 4. The variance estimator is constructed by 
differentiating the statistic with respect to the sample 
weights. The approach to linear approximation is closely 
related to methods based upon the influence function (e.g., 
equations 1.6 and 1.13) and the paper provides a helpful 
review of such methods in section 1. The authors note that 
it is not easy to verify the validity of such methods for 
statistics which are not smooth functions of (or a fixed 
number of) linear statistics and it would be interesting to 
know how far the proposed approach does indeed provide 
valid variance estimators for statistics, such as quantiles, 
which are not of this form. 

A key feature of the proposed approach, which ensures 
the unique construction of the variance estimator, is that 
derivatives are evaluated at values based on the achieved 
sample, without any initial evaluation of the approximating 
linear statistic at theoretical population values. Such initial 
evaluation may lead to non-uniqueness when auxiliary 
information is available, for example on a population mean, 
X, and it is assumed that this value is equal to the limiting 
value of a corresponding sample statistic, x. For statistics 
which are smooth functions of linear statistics, it appears 
that the variance estimator generated by the proposed 
method may also be constructed by conventional Taylor 
series methods, provided no initial simplification of the 
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variance estimator takes place based on such assumptions 
about auxiliary information. Such construction may, 
however, be less clear-cut than for the proposed approach. 

Assumptions employed by linearization methods 
differing from the proposed approach, such as that an 
auxiliary value X is the theoretical limiting value of a 
sample value x, are based upon unconditional distributions 
and so it might be anticipated that the incorporation of such 
assumptions into a variance estimator might damage the 
method's conditional properties, especially with respect to 
statistics such as x. The proposed procedure avoids 
dependence upon such assumptions and, by evaluating 
derivatives at achieved sample values, may be expected to 
track conditional properties more closely. (There appear to 
be parallels with Efron and Hinkley’s (1978) arguments in 
favour of the observed versus the expected information, 
although the context is rather different.) 

The avoidance of dependence upon such assumptions 
may not only benefit the conditional properties of the 
proposed approach, but also protect the variance estimator 
against possible biasing effects of non-sampling errors. The 
auxiliary population information may differ from the 
limiting values of the corresponding sample statistics either 
because of non-response or non-coverage or because of 
discrepancies in the way the auxiliary variables are 
measured. In such circumstances, linearization methods 
differing from the proposed approach might lead to 
inconsistent variance estimation. For this reason, Fuller 
(2002, page 10) recommends the use of the g-weights in 
(3.6), as proposed, especially in the presence of 
nonresponse (page 15). With regards to the latter case, it 
seems worth noting that the validity of the proposed 
procedure does not appear to depend on the requirement 
that E(d(s)) = 1, provided 1 is replaced by E(d(s)) in the 
development in section 2. In particular, if s denotes unit 
respondents and non-response may be represented by 
Poisson sampling with unknown response probabilities then 
the proposed approach to variance estimation may still be 
consistent (when based on many standard variance 
estimators for linear statistics), even if d(s) is based only on 
sampling inclusion probabilities. 

Julia d’ Arrigo and I have recently studied the properties 
of linearization variance estimators under nonresponse in 
simulation studies as part of the DACSEIS research project 
(www.dacseis.de) using data from the UK Labour Force 
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Survey and the German Income and Expenditure Survey. 
We considered various calibration estimators under Poisson 
models for unit non-response which were ignorable given 
the calibrating variables, using standard variance estimators 
for linear statistics under stratified multi-stage sampling. 
We indeed found that nonresponse could lead to serious 
biases in the linearization variance estimators if they failed 
to take account of the g-weights for GREG estimation 
(section 3.1.) or ignored the F(x, A) term in (3.21). Such 
biases were absent in the proposed approach. 

We also investigated the alternative calibration 
estimators discussed in section 3.4. Deville and Sarndal’s 
(1992) theoretical finding that the asymptotic variance of Y i 
does not depend on the form of the function F(-) is based 
on the assumption that Yd (Sx , 18 consistent for X. This 
assumption may not hold under various sources of 
non-sampling error, and is not required for the proposed 
approach. Hence, the appropriate approximate linear 
statistic (under departures from this assumption) is defined 
by (3.21) and the resulting variance estimator may depend 
on the form of F(-), even asymptotically. The standard 
linearization variance estimators in which d,(s) f (x (4) in B, 
is replaced by d,(s) or w,(s) may be inconsistent if these 
weights differ from d,(s) aay Despite this theoretical 
fact, we observed little difference in our simulation study 
(for each of the functions, 1 + u, exp(u), and (1 - u) ', used 
for F(u)) between the statistical properties of variance 
estimators based upon these three different choices of 
weight, d,(s) vit): ds) or w,(s), in the B, vector in 
(3.21). Others studies might produce different findings. 

A disadvantage of the linearization methods considered 
here compared to replication methods is the need for 
analytic differentiation. It would appear from the examples 
presented in this paper that the analytic differentiation 
involved in the proposed method is at least as straight- 
forward as that in standard methods of Taylor series 
expansion of smooth functions of linear statistics. 
Nevertheless, in some applications, it may be advantageous 
to replace the human labour and possible human error 
arising with analytic differentiation by the use of ‘numerical 
differentiation’. The proposed approach might be described 
as an infinitestimal jackknife method since it perturbs the 
weight given to each sample observation by an infinitesimal 
amount to determine the approximating linear statistic. The 
derivative with respect to a weight in the proposed approach 
may be approximated numerically by a finite difference 
approach in which the statistic is recalculated with the 
weight perturbed by a finite amount for each observation in 
turn. This approach may be described as a jackknife method 
of linearization. A conventional approach would be to 
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change each weight to zero in turn, perhaps standardizing 
for unequal weights as in (1.15). It does not seem essential 
to replace the original weight by zero and, in principle, each 
weight might be perturbed in some other way, for example 
by reducing it by a fixed amount 6, smaller than the 
minimum value of d,(s). It seems likely that in many 
applications the variance estimator arising from such 
jackknife linearization will have very similar statistical 
properties to that constructed by the proposed approach. 
The choice between the estimators is likely to depend more 
on practical and computational considerations. 

My final comments are on terminology. There are 
practical reasons why it may be helpful to give the z, 
variable a name. In particular, this may be helpful for the 
practitioner who, for some complex statistics, has to employ 
two separate computational steps: (a) construction of the z, 
variable, for example using least squares routines when 
calibration weighting is used, and (b) use of standard 
variance estimation software for linear statistics. Different 
names are used for z, in the literature. Woodruff (1971) is 
usually acknowledged as the first paper in the survey 
sampling literature to draw attention to the role of z, and 
Andersson and Nordberg (1994) refer to z, as the Woodruff 
transformation. Woodruff and Causey (1976) refer to the 
approximating linear statistic as the linear substitute and z, 
as the substitute variable. In the more mainstream statistical 
literature, Davison and Hinkley (1997, page 46) refer to the 
Z, as the empirical influence values. The term linearized 
variable, as used by Deville (1999), seems to me a simple 
and natural one. It is consistent with the use of the term 
linearized statistic to denote the approximating linear 
statistic and the term linearization for the method (which is 
a more suitable general term than Taylor series method for 
the broad class of approaches considered here). 
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Response from the Authors 


1. INTRODUCTION 


We thank the three discussants, Phillip Kott, Babubhai 
Shah and Chris Skinner, for their insightful comments. Our 
rejoinder will attempt to address some of the issues raised 
by the discussants. The main aim of our paper was to study 
variance estimation for calibration estimators of population 
totals and nonlinear parameters, 0, defined as solutions to 
“census” estimating equations. We proposed a new Taylor 
linearization approach that provides a unique variance 
estimator, by avoiding initial evaluation of the linearized 
statistic at the population values. We have also shown that 
the variance estimator satisfies some desirable consider- 
ations, such as approximate model unbiasedness and 
validity under a conditional repeated sampling frame work, 
at least in a number of important cases. We have also shown 
that in two-phase sampling the variance estimator makes 
fuller use of the first phase sample data compared to 
traditional linearization variance estimators. 


Kott 


Kott’s discussion focused on three applications in our 
paper: (i) the jackknife linearization variance estimator, 
V,,, Of the ratio estimator Y p= (y/x)X in simple random 
sampling mentioned in section 1; (ii) the general class of 
regression calibration weights considered in section 3.2; 
(ii1) the general class of calibration weights studied in 
section 3.4. Regarding (i), we noted the result that v,, is 
both asymptotically design unbiased and approximately 
model unbiased under the ratio model E_,(y,) = Bx, and 
EE Gis Bi OX,. Kott is correct in saying that the model bias 
may not be negligible if the sampling fraction, n/N, is not 
small. If n/N is “ignorably small”, then model unbi- 
asedness is, in fact, valid under a general variance function 
Vy.) = o, as noted by Kott and previously by Sarndal 
et al. (1989). Under the ratio model, Kott proposes a more 
appropriate variance estimator, v,, that is model unbiased 
even if n/N is not small and also valid under repeated 
sampling. The leading terms of v,, and v,, are identical, 
and our new approach captures only the leading term. It 
should be noted that model- unblasedness of v,, depends on 
the validity of the assumption T= Cex: 

Turning to (ii), we have shown in section 3.2 that if the 
general class of regression calibration weights, (3.7), are 
used, our approach leads to a variance estimator that is quite 
complex, involving third and fourth order moments of the 
design weights d,(s) with d,(s) =0 if the k™ population 
element is not in the sample s. Kott proposes an attractive 
choice of weights obtained by replacing c, x, in the GREG 


weight (3.1) with q., 2 (1,,-1,%,)x ,/(1,1,). This 
choice gives a variant of the “optimal” linear regression 
estimator and also avoids the complexities associated with 
the variance estimator based on the weights (3.7). This is an 
interesting and useful proposal, but q/,, requires the 
knowledge of the x-vector for all the population elements, 
unlike (3.7) which depends only on the population total X; 
in practice, only X may be available. Moreover, q/,), 
depends on all the N(N - 1)/2 joint inclusion probabilities 
m,, and hence computation of q (yk may become 
cumbersome when the sampling design is based on unequal 
probability sampling without replacement. 

Turning to (iii), Kott proposes a generalization of the 
calibration weights w,(s)=d,(s)F(x,4) in section 3.4 by 
replacing x, with “instrumental” variables q, having the 
same dimension as x,. The corresponding z-variable in he 
variance estimator iQ) is similar 4 our (3.21) with x Aa 
and x Ke, in B, changed to q Xk k and q,y, respectively 
and Fi (x (A) cere to F (q,d). This is an useful 
extension. Kott notes a B, remains a model unbiased 
estimation of B, if fq; 2») in B, is replaced by any 
constant and the resulting Z, 1S arsed asymptotically 
under repeated sampling. However, Kott also notes that the 
term CH can matter even asymptotically if the 
calibration is used to adjust for unit nonresponse by treating 
sample response as a second phase of sampling. Using the 
result for two-phase sampling given in the Appendix, Kott 
then obtains a corresponding variance estimator, v(Y aoe 
This extension for nonresponse setting is also useful. It is 
indeed surprising that the second term in V(Yoo) provides 
the model based correction he recommended for the ratio 
estimator Y z under simple random sampling in the absence 
of nonresponse. 

Finally, Kott raises a question on the customary 
“plug-in” or “substitution” method used for variance 
estimation, as done in (2.4), where we plug in z, for Z,. 
He asks if it is legitimate to plug in z, for Z,, where 
SE Wa 0, //n), when they are n(n -1)/2 terms in the 
variance estimator v(Z,)s as in the case of Sen-Yates- 
Grundy variance estimator. We are not sure if we have 
understood his point correctly, but as long as O ae //n) is 
uniform in k, say al/n, then v(z) =(Z)+ one order 
terms. 


Shah 


Shah’s prescription (steps 1-4) clearly summarizes our 
method. Shah also notes that his steps 1 and 2, leading to 
our z-variable, produces the “correct” linearization in many 
other important applications not studied in our paper, 
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including Wilcoxon two sample rank sum test and 
estimation of regression coefficients and hazard rate in the 
Cox proportional hazard model. Shah’s unpublished paper 
(seen by courtesy of the author) spells out the z-variable for 
those applications, but using design weights. Extension to 
calibration weights should follow along the lines of section 
os 

Shah makes an important point that step 3 for the 
computation of the variance estimate is independent of the 
linearization in step 1 and 2 and that it is “aptly demon- 
strated by the discussion on two-phase sampling in section 
4”. He also notes that for two-phase sampling, linearization 
(step 1) can be performed using only the first-phase H-T 
weights 1,,, by treating the second phase weights, 15, 
if kes and 0 if k is not in the second-phase sample s as 
data, and that the resulting step 2 produces the same 
approximation as given in our paper. We have verified this 
equivalence result for the two-phase ratio estimator in 
Example 4.1, and it is likely to hold generally. Shah’s 
proposal might simplify the implementation of step 1 to 
some extent. 


Skinner 


Skinner gives a clear appraisal of our linearization 
method and raises a number of important points: (i) termi- 
nology, (ii) possible extensions to non-smooth statistics 
such as quantiles, (iii) modifications of the method to 
handle unit nonresponse, (iv) possible use of numerical 
differentiation to calculate the z,-variables. 

With regard to point (1), Skinner notes that it would be 
useful to give the z, variable a name since different names 
have been used in the literature. He suggests that the term 
linearized variable, as used by Deville (1999), is a simple 
and natural one since it is consistent with the usage of 
linearized statistic to denote the approximating linear 
Statistic and linearization for the method. We are in 
agreement with Skinner’s suggestion. 

Turning to point (ii), a difficulty in extending our 
proposal to nonsmooth statistics 6 = f(d(s)), such as 
quantiles, is that f(-) is not a differentiable function. A way 
to get around this difficulty is to approximate 6-0 bya 
differentiable function and then apply our method to the 
approximation. For example, in the case of the p® quantile 9, 
Francisco and Fuller (1991) and Shao (1991) established 
the following asymptotic approximation valid for stratified 
multistage designs: 

i ome 
a {F.,(0) -p}, 
where F (0) = Lw,(s)I(y,< 9) /2w,(s) is the calibration 
estimator of the distribution function F(-) at 0, F(@) = 
NX I(y,<9)=p, and h() is the value of the density 
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function h(:) at 8. The definition of h(-) requires refer- 
ence to a sequence of populations (Shao and Rao 1993) or 
to a superpopulation (Francisco and Fuller 1991). We used 
h(-) to denote the density rather than the customary f(-) 
because we used f(d(s)) to denote the estimator 6. Now, 
suppose w,(s) = d,(s)g,(d(s)), where g,(d(s)) is the 
GREG weight given by (3.1). We can then use (3.2) and 
(3.3) to get the linearized variable z, from the above 
approximation to 8-0, by replacing h(@) with a suitable 
estimator h (6); for example the kernel-based estimator of 
h(-) used by Berger and Skinner (2003). Similarly, one can 
apply the method to general calibration weights, w,(s), 
using the results of section 4. Variance estimators of a low 
income proportion, say 9 = F(t/2) where t is the median 
income, can also be obtained using the asymptotic approxi- 
mation for 6 - 6 developed by Shao and Rao (1993). Berger 
and Skinner (2003) studied variance estimation for a low 
income proportion when generalized raking ratio weights, 
w,(s), are used. We can apply the results in section 3.2 to 
this case, and the resulting linearized variable z, will 
account for the calibration. Also, it will be different from 
the Deville z-variable (10) in Berger and Skinner (2003). 

The modification suggested in point (iii) to handle unit 
nonresponse is very important, and it broadens the 
applicability of our method. As noted by Skinner, Kott and 
Fuller (2002), it is important to retain the g-weights in 
variance estimation whenever the limiting values of the 
estimators X differ from the corresponding control totals X, 
as in the case of non-response or non-coverage. Our method 
automatically accounts for the g-weights and may lead to 
consistent variance estimators in such cases. Empirical 
results of Skinner with d’ Arrigo in this context are very 
interesting. The case of variance estimators for alternative 
calibration estimators, studied in section 3.4, relative to 
customary variance estimators that replace d,(s) f oe d) in 
the expression for B, by d,(s) or w,(s) need further study, 
as noted by Skinner. 

It may be noted that unit nonresponse is typically treated 
as second phase sampling (e.g., Poisson sampling with 
unknown response probabilities) and Skinner notes that our 
method may lead to consistent variance estimators even 
when the estimators are based only on the sampling 
inclusion probabilities. However, control totals X are 
needed to get valid estimators of the total Y, under some 
assumptions on the response probabilities (Fuller 2002, 
equation (8.4)). We have extended our method to handle 
weight adjustment for unit nonresponse and imputation for 
item nonresponse when control totals are not available, 
assuming uniform response within classes (Demnati and 
Rao 2002). The resulting variance estimators are naturally 
more complex compared to Skinner’s modification for unit 
nonresponse in the presence of control totals. 
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Turning to point (iv) on the possible use of numerical 
differentiation to calculate the linearized variables z,, 
Woodroff and Causey (1976) used such a method to 
calculate the derivatives 0g(a)/da,|_,_y given in (1.4) when 
8 =g(Y). Skinner proposes perturbing each weight d, (s) in 
turn and then recalculating 9; for example, by replacing it 
by a fixed amount 6 smaller than the minimum value of 
d,(s),kes. He conjectures that the proposed approach 
should lead to variance estimators very similar to those 
obtained through analytical differentiation. It would be 
useful to study the statistical properties of the proposed 
approach to analytic differentiation of f(d(s)) with respect 
to weights d,(s). 

We hope the discussions by Kott, Shah and Skinner will 
stimulate further work on the approach to variance estima- 
tion presented in our paper. 
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Weighting Sample Data Subject to Independent Controls 


CARY T. ISAKI, JULIE H. TSAY and WAYNE A. FULLER’ 


ABSTRACT 


In the U.S. Census of Population and Housing, a sample of about one-in-six of the households receives a longer version 
of the census questionnaire called the long form. All others receive a version called the short form. Raking, using 
selected control totals from the short form, has been used to create two sets of weights for long form estimation; one for 
individuals and one for households. We describe a weight construction method based on quadratic programming that 
produces household weights such that the weighted sum for individual characteristics and for household characteristics 
agree closely with selected short form totals. The method is broadly applicable to situations where weights are to be 
constructed to meet both size bounds and sum-to-control restrictions. Application to the situation where the controls are 


estimates with an estimated covariance matrix is described. 


KEY WORDS: Raking; Regression; Quadratic programming; Coverage adjustment; Integer weights; Weighting area. 


1. INTRODUCTION 


Given the availability of known characteristic totals, it is 
common among survey practitioners to use such in- 
formation in estimators of the post stratified, ratio and 
regression type. The known characteristic totals are some- 
times called independent controls because they are derived 
outside of the survey situation. Use of independent controls 
tends to reduce the variance of most estimates. Independent 
controls also often compensate for coverage problems in 
surveys. See Deville and Sarndal (1992) and Fuller (2002). 

The U.S. decennial census utilizes a sample for the 
measurement of selected characteristics. The questionnaire 
for these characteristics is called the long form and the 
sample for the long form consists of a random sample of 
addresses. The long form questionnaire requests information 
that is asked of all individuals (called short form infor- 
mation) plus information on a set of additional charac- 
teristics. In previous Censuses, raking to controls based on 
short form information was used to construct weights for the 
long form sample. Two sets of sample weights were created, 
one for person characteristics and one for housing unit 
characteristics. 

The set of categories used for person weighting was a 
Classification of individuals by race, Hispanic origin, age 
and sex, family type, and household size. For households, 
the categories were the cross classification of race by 
Hispanic-origin-of-householder by tenure by household type 
and size. In the 1990 Census long form weighting process, 
persons and housing units were each classified by four sets 
of classifications for raking in four dimensions. When 
raking was completed, the long form sample weights were 
converted to integers. Integer weights are desirable because, 
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unlike real weights, integer weights provide arithmetically 
consistent totals of integral characteristics. For details, see 
Schindler, Griffin and Swan (1992). 

Long form weighting using short form census infor- 
mation is a part of the Canadian Census of population and 
housing. Unlike the procedure used by the U.S. Census 
Bureau (USCB), the procedure used at Statistics Canada 
constructs a single set of household weights using regres- 
sion estimation. See Bankier, Houle and Luc (1997). Should 
the initial weights generated by the regression procedure 
exceed prescribed bounds, collapsing of cells defining ex- 
planatory variables is carried out. Linear dependencies and 
near linear dependencies among the explanatory variables 
are also removed by eliminating variables. See Bankier, 
Rathwell and Majkowski (1992). 

Lemaitre and Dufour (1987) used a generalized least 
squares estimator (GLS) to construct weights meeting 
person and household constraints. Alexander (1987) con- 
siders a procedure for constructing household weights in the 
census setting. One of his distance functions is similar to the 
one used in this paper. 

The use of quadratic programming to compute regression 
weights in the survey context was suggested by Husain 
(1969). An application of quadratic programming (QP) in a 
Census environment is that in Isaki, Ikeda, Tsay and Fuller 
(2000) where household weights for Census households 
were obtained using person totals as controls. Motivation for 
the use of various distance functions can be found in these 
two papers and in Deville and Sarndal (1992) who discuss a 
general class of estimators called calibration estimators. 
Fuller, Laughin and Baker (1994) consider a regression 
weight generation procedure that is modified so that all 
weights are positive and very large weights are made 
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smaller than the corresponding least squares weight. 
Jayasuriya and Valliant (1996) also consider a restricted 
regression. Fuller (2002) is a review of regression esti- 
mation. 

Our proposed long form weighting method is a type of 
regression estimation and, like the Statistics Canada 
approach, provides a single set of household weights that 
maintain given independent controls. We generate house- 
hold weights using quadratic programming with the re- 
strictions that the weights fall within a specified range and 
that the weights maintain control totals. In the following, we 
refer to the suggested method as the quadratic programming 
method or QP. 


2. THE QUADRATIC PROGRAMMING 
METHOD 


The purpose of quadratic programming is to produce 
sample weights that i) are close to initial weights, ii) are 
within reasonable bounds, iii) maintain specified control 
totals and iv) produce a design consistent estimator. Apart 
from the bounds on the weight, the weights from quadratic 
programming are those of a simple regression estimator. We 
first describe the mathematical form of the QP and then 
discuss the implementation. Let 


i) {W;;i=1, 2, ..., } denote the set of final housing unit 
weights, where i denotes the i iN long form sample 
household and n is the size of the long form sample, 


ii) {W,; i=1,2,...,n} denote the set of initial housing 
unit weights, 

Uh) Kj F dydoces oI t = 142.) 57; «denote . the... obser- 
vation on the 7 person control variable for the i - 


sample household, 

WV) Bi FH 2.2 yo tT, 2 en? denote*the* obser 
vation on the jim household control variable for the i” 
sample household, 

Vy Agey le 2, sy My, CENOLE Ute T person control, 

vi) Z;, j = 1, 2, ..., mM, denote the j household control. 

The quadratic programming method seeks W,, i=1, 

2, ..., n, that minimize a quadratic objective function subject 

to linear constraints. In our application we minimize 


ew) = >: (ww?) [wel, @ 
i=1 
subject to 
3 WX, = X;, for f = Loe. Mm,» (2) 


SWZ. SZ, lon oe 
i=] 


LWA (4) 


where the summations are over housing units in the long 
form sample. Observe that the long form household weights 
are bounded below by one. This is on the basis that an 
element in the sample should at least “represent” itself. In 
our program, K was set equal to 48 but the bound was never 
attained. The lower bound of one was attained. The 
FORTRAN subroutine from IMSL was used to solve the 
QP. Other programs, such as LCP of SAS®/IML, are 
available. 

The USCB’s current long form weighting procedure 
rakes the initially weighted long form sample counts to the 
census counts for the control categories. The weighting is 
done by subdivisions of the country called weighting areas 
and is done separately for person and household char- 
acteristics. The nominal sample rates for the long form are 
one-in-two, one-in-six, and one-in-eight. The nominal 
sampling weights are the inverses of the nominal sampling 
rates and are denoted by W,°”. A second set of weights, 
denoted by W,°”, are the realized sampling rates calculated 
for cells, where the cells are required to contain at least five 
sample households. For details on the USCB’s procedures 
see Schindler et al. (1992). 

Since we intend to compare the raking and QP methods, 
we use most of the USCB’s person and household cat- 
egories as the X; and Z; control totals in the quadratic 
program, but some changes were instituted. For example, 
while we maintained all of the age-race-sex person cat- 
egories, we did not use a category based on the nominal 
sampling rates. 

We used the USCB’s specifications for determining 
whether a cell category would be retained as a separate 
control or would be combined with another cell and we used 
the USCB’s procedure for determining the cells to be 
combined. This capitalized on the USCB’s experience and 
minimized differences between the USCB’s set of long 
form control totals and the set used by the QP method. The 
procedure used to define W,'”’ is given in the appendix. 

Two possibilities exist for the control totals to be used in 
the construction of weights for the long form of the U.S. 
2000 Census. One possibility is to use controls from the 
2000 Census short form. That is, the independent controls to 
be maintained in long form weighting are those that are 
tabulated from the Census short form. When the Census is 
used as the control, the person control (X;) categories 
include a cross classification of age and sex-race/ethnicity. 
Other characteristics, such as tenure, were used as additional 
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controls. The majority of the household control categories 
(Z; ) are defined by a cross classification of household type 
(e.g., family with children under 18) and household size 
(e.g., number of persons in the family). The Z; also include 
race/ethnicity of the householder cross-classified by tenure. 

The other possible set of controls for the 2000 Census is 
the set of estimates from the post enumeration survey, called 
the Accuracy and Coverage Evaluation (A.C.E.) survey. 
The A.C.E. survey is designed to estimate person char- 
acteristics only. The X; for the A.C.E. include age-sex- 
race/ethnicity-tenure controls. 

The last step in long form weighting is to round the W; to 
integers. Integer weights prevent discrepancies between sets 
of estimates caused by rounding of real valued estimates. 
Sample housing units were grouped by race/ethnicity of the 
householder and by tenure. Then within each group, the 
sample was sorted by family type by household size. The 
weights were then rounded to integers using the cumulate- 
and-round procedure. Table 1 illustrates the method. The 
partial sums of the weights are formed (cumulated) as 
shown in the column CW. The partial sums are then 
rounded as shown in the column RCW. The integer weight 
for element i is the difference between successive entries 
i—1 and i in the RCW column. 


Table 1 
Illustration of Cumulate and Round 

Sample Initial CW RCW Integer 
Unit Weight Weight 

1 3:333 3.333 3 3 

2 2.500 5.833 6 ] 

@} 1.428 7.261 7 1 

1.250 8.511 9 2 

5 EV 9.622 10 1 

6 5.021 14.643 tS 2) 


3. VARIANCE ESTIMATION 


Variances of long form estimates were estimated using 
the jackknife method. In the numerical results using census 
controls, sixteen replicates were formed. Sixteen was chosen 
for convenience and a larger number could have been used. 
The long form sample was ordered by the census iden- 
tification number within blocks and sixteen replicates were 
formed as the sixteen one-in-sixteen systematic samples. 
Sixty seven replicates were formed for the estimates using 
ACE controls. 
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3.1 Replicates for Census Controls 


The jackknife replicate is created by deleting the i” 
group of elements, computing the quadratic programming 
weights and rounding the weights to integers. Because of 
the rounding, the usual jackknife variance estimation 
procedure required modification. To isolate the effect of 
rounding, we consider the replicate estimate constructed 
with real-valued weights. Let 


8, = the sample estimator with weights rounded to 
integers, 

0 A = the sample estimator with real-valued weights, 

0 ri) = Jackknife replicate estimate with e group 
deleted and real-valued weights, 

654 = jackknife replicate estimate with i” group 
deleted weights rounded to integers, 

and let 

Dae ane (5) 
i=] 


where r is the total number of replicates. Then the jackknife 
deviation for the estimator with integer weights can be 
decomposed as 


A A 


8 i) 0, = 8 avi) =i, 
a 16. = 0, af (8 rc) —9p) | (6) 


We assume that the error in the rounding operation is 
independent of the group chosen for deletion, a reasonable 
assumption, given that the deletion produces an entire new 
set of weights to be rounded. Then 


EVO uy 8)? } = EVO ~84)"] 
b ELL. - 8 xii Ve 00,; aap (7) 
Assume that the average of the Gre is equal to 0 ee 


Then the last term of (7) is a replicate deviation for the 
difference between the real and rounded estimates. Then 


A A = A 2 
F{[(6. Ske ( 0, —9p ) 2 
5 “ “ AM a(S) 

r (r afi IV {8,4 i buh f= v16, <a | 
where V { 8.4) ~§ a(i)s 1s the variance due to rounding for a 
sample of r — 1 groups and V{@,,—0,} is the variance 
due to rounding for a sample of r groups. In obtaining (8) 
we assumed the variance due to rounding for a sample of r 
groups is the variance for r—1 groups multiplied by 
r~'(r—1). Thus 
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Bry 6 iy -3,) + 
i=1 


E{(r-1)2rV_ {6,3} 
eV Or i. ts FQ) 


where 


VetO enero) » (Oe =O) 
cm 


is the jackknife variance estimator for the estimator with real 
weights. Then an estimator of the variance due to rounding 


2 


(r-1)7> (6, —9,) 
=I) 
Sr( relieve} 
= » Gre -6,) 
=P i=l 


HP -DOEVAG 3 (10) 


r'(r-1) 


Based on these results, the estimated variance for the 
rounded estimator is 


V{Ow}= (r-l)1(r—2) Vp {8p} 
ey (6 yi) — 94)? (11) 
i=l 


3.2 Replicates for A.C.E. Controls 


The replicates for estimates constructed with A.C.E. 
controls were modified so that the estimated variances 
contained a component for the error in the A.C.E. estimates. 
The data in a weighting area were assigned to 67 replicates 
where 67 is the number of controls. The procedure requires 
the number of replicates to equal or exceed the number of 
controls if the covariance matrix of the estimated control 
totals is to be reproduced. More replicates than controls can 
be used. See Fuller (1998). 

The estimator of the total of a characteristic for the long 
form is a type of regression estimator using the A.C.E. 
numbers as controls. We write the estimator for the total 
based on real valued weights as 


0, = X,B., (12) 
where X 4 1s the vector of A.C.E. estimates and p is the 
regression coefficient computed with the long form data. 

Let V aa be the r x r covariance matrix of the vector of 
A.C. controls, where V aa 1S estimated as part of the 
A.CE. process, andr = 67. Let:1.,, 45, ...5 A, be the roots of 
V aa and let 


QV; (13) 


where A=diag(A,,A,,...,4,), Ay 24. 2...24,, and Q 
is the matrix composed of the characteristic vectors of V,, 
Recall that 


= QAQ’ 


and 
j=l = 


where q.;is the j column of Qand z,, = A? q,;. 
Using result (14), controls for the r replicates were 
constructed as 


Miia Mi poz sri el) cea aleclo 


ej? 


where X 4 is the row vector of the original controls and c is 
a constant. The constant c is determined so that the ex- 
pectation of the sum of the jackknife squared deviations for 
the elements of the vector X are the diagonal elements of 
V,,,- In our application, the constant c is (r—1)7'”? r'”? and 


(rm Dr Cc? Z,; Z 
jz 


=Y 2521, = Va 16) 
j=l 


Thus, if the characteristic being “estimated” is one of the 
controls used in the QP, the jackknife procedure returns the 
A.C.E. estimated variance for that characteristic. The z,; are 
assigned at random to the r replicates. 

Using the regression representation, we write the 
estimator for the i" replicate as 


8 rciy TF X Aci) BY) 


Si Baas +(X X ay ~ X, )B,, 


A MONE for seg (17) 


wove ) ri) JS the real-valued estimator computed with the 

" group deleted using x ac aS the control Vege B,, » Is 
- regression coefficient computed with the ie group 
deleted, and 0 ri) 18 the real-valued estimator computed 
with the i® group deleted using x7 as the control vector. 
Then 


OD aciy —8n = Day —9R +€2Z,; By). 

Because q,; are assigned to replicates at random, the 
expectation of the replicate variance estimator for the real- 
valued estimator based on A.C.E. controls is 
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elisball-e{ 08, mB} 


-E| ro (250% (6,.. -§,)| 
i=l 


i EAB Van B.| nee) 
. Now, assuming ERY p= Vase (Bua: and that 
V4 1s independent of B,,,, 


E{B., Vas Bn} e BV a4 B 


+ iV {Boy} Van \ 
where tr{V,,} is the trace of the matrix. It follows that 


Br (r-1) x (6 ain — 9p *} 
i=] 


-E{ (r-1) S (Oni ~ OR *} 
=) 
+B’V,,B+O(n~), (19) 


where we assume tr {V,,} =O (n") and tr[V ge A 
O(n''), where n is the sample size. The first term on the 
right of the equality in (19) is the expectation of the variance 
estimator for the variance due to the sampling of long forms 
from the census. The second term is the contribution of the 
variance of the error in the A.C.E. estimates to the total 
variance. Thus, the variance estimator based on 6 Ri) 
estimates both components of variation. Observe that the 
estimated covariance matrix for the controls is V Anes at 
should be. 


4. NUMERICAL RESULTS 


We used the USCB’s 1990 Census data file to illustrate 
the application of the QP method to actual data. The file 
provides data for households and for persons in households, 
together with long form weights as developed for the 1990 
U.S. Census. Hence, the file provides data appropriate for 
comparing the performance of the USCB’s 1990 long form 
weighting method with the QP method. 

The USCB long form sample weighting is done by 
weighting area, where the weighting areas usually contain 
two to three thousand housing units. There were about 
56,000 weighting areas in the U.S. in 1990. For our 
numerical work we chose weighting area (WA) 1788 that 
contains 8,034 occupied housing units and 25,145 persons. 

In Table 2 we provide estimates of some person and 
housing unit characteristics for weighting area 1788. The 
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characteristics in the table, except the number of rented 
units, were suggested by subject matter personnel at the 
USCB. In Table 2, Est.(H) is the long form sample weighted 
estimate computed with housing unit weights, Est.(P) is the 
long form sample weighted estimate computed with person 
weights. The quadratic programming estimator constructed 
with Census controls is called QP in the table, while QPG is 
used to denote the generalization of the quadratic pro- 
gramming estimator with objective function (20). The QPG 
estimator is discussed subsequently. The USCB housing 
unit estimates in Table 2 that are based on person weights 
were created by using the householder weight as the 
housing unit weight. Every occupied unit contains a single 
householder. The householder procedure is called the 
principal person method by Alexander (1987). All estimates 
in the table are given as a percent of the census count. 

Estimates constructed by the two USCB methods can 
differ by several percentage points with the differences 
between Est.(P) and Est.(H) for rented units, persons aged 0 
to 4 years, persons aged 65 and over, Hispanic, Asian, and 
persons in rented units being noticeable. The Est.(H) 
estimate for persons in rented units is closer to 100 than the 
Est.(P) estimate. 

The cell collapsing rules produced 45 person and 22 
housing unit controls for WA 1788. An example of a person 
control is the total number of Non-Hispanic Black males 
aged 65 and over, while an example of a housing unit 
control is the total number of Non-Hispanic White owned 
housing units. Total Black persons is an implicit control in 
WA 1788. Controls for total persons 18-44, total persons 
45-64, total males, total renters and total number of rented 
housing units were added to the QP. Apart from the controls 
mentioned above, none of the remaining characteristics in 
Table 2 is also used as a control in the QP procedure. 


The QP estimates and standard errors of the QP estimates 
are given, as a percent of the census counts, in the fourth 
and fifth columns of Table 2. The agreement between count 
and QP estimates for household characteristics are 
comparable to the USCB household based estimates and 
superior to USCB person based estimates. For person 
counts, the QP estimates are generally closer to the census 
counts than either of the USCB raking estimates. 


The largest difference between a QP estimate and the 
census count relative to the standard error is for the estimate 
of the number of households with own children present, 
where the difference is about 1.6 standard errors. The 
majority of the QP estimates differ from the census count by 
less than one standard error. A number of the USCB person 
estimates deviate from the census count by more than one 
QP standard error. 
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Table 2 
Estimated Occupied Housing Unit and Person Characteristics for WA 1788 


Census zy ig t se (QP) 7 se (QPG) 
Count Count Count Count Count Count 
Housing unit characteristics 
With Own Children 4,349 100.18 100.45 100.21 0.13 100.18 0.14 
Not With Own Children 3,685 99.78 99.67 99.76 0.15 99.78 0.16 
With 1 to 4 Persons 6,785 100.00 100.57 100.04 0.05 100.07 0.05 
With 5* Persons 1,249 100.00 Sei 99.76 0.30 99.60 0.30 
Rented Unit 2,559 100.00 95.97 100.00 0.19 99.92 0.16 
Owned Unit 5,475 100.00 102.02 100.00 0.09 100.04 0.08 
Person characteristics 

Age 0—4 years 2,493 101.92 97.95 98.84 1.68 99.96 0.29 
Age 5-17 years 6,339 103.91 101.07 100.63 0.71 99.98 0.18 
Age 18-44 years ATA 99.50 99.69 100.01 0.05 100.00 0.06 
Age 45-64 years 3,028 101.65 101.95 99.90 0.09 99.97 0.09 
Age 65” years 574 81.18 9313 100.17 0.85 100.00 027 
Males 12,473 99.95 99.64 100.06 0.08 99.98 0.09 
Females L2O72 101.43 100.36 99.95 0.10 100.01 0.09 
Hispanic 2,385 95.38 103.40 99.96 0.38 99.87 0.38 
Not Hispanic 22,760 101.25 99.64 100.03 0.07 100.00 0.10 
Black 1,285 101.08 101.79 100.86 1-22 99.77 0.54 
White 22312, 100.69 99.91 100.03 0.07 100.00 0.10 
Asian IS yk 92.60 80.05 96.83 ESD) 99.76 0.50 
Remainder 15231 101.94 103.89 105.84 9.54 100.78 eS 
In Rented Unit 7,978 102.04 95.41 100.01 0.24 99.92 0.19 
In Owned Unit 17,167 100.06 102.13 100.00 0.09 100.02 0.13 


* 


USCB weights for households 

USCB weights for persons 

QP weights with 82 constraints 

Generalized QP with 13 constraints and objective function (20) 


i 
tT 


Because the number of rented units, persons aged 18-44, The 45 person and 22 housing unit control totals obtained 
persons aged 45-64, males, and persons in rented units were _ by the collapsing rules are such that a margin estimate, such 
used as controls in the QP procedure, differences between _as total males, may not be constrained to agree with the 
QP estimates and census totals for those categories are due count. In addition, for different weighting areas, USCB’s 
to rounding. The standard errors demonstrate that the collapsing procedure gives different person and housing unit 
rounding can lead to sizeable deviations from the controls. constraints. Thus we considered adding some margin totals 
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to the set of control totals. To reduce the impact of the added 
controls on the weights, we replaced the original constraints 
with additional terms in the objective function. The terms 
are deviations between the final estimates and the control 
totals. The objective function becomes 


67 < 
GW )=2(W)+> 7p) W;X -X,) 0) 
j=l i 

Where g(W) is defined in expression (1), the 
{ Xj, 7 =1,2,...,67} is the set of auxiliary variables 
defining the 45 person and 22 housing unit controls, and a, 
are constants to be specified. The X;; for category j of 
household i for a person characteristic is the number of 
individuals in category j in the housing unit. The Xj for a 
housing unit characteristic is one if the housing unit has the 
characteristic and zero otherwise. In our application, the 
function is minimized subject to two household controls and 
eleven person controls. The housing unit controls are rented 
housing units and owned housing units. The person controls 
are persons 0 to 4 years, persons 5 to 17 years, persons 18 to 
44 years, persons 45-64 years, persons 65 years and over, 
males, black, white, Asian, Hispanic, and renters. The @ j 
are 10[W J! Gale , where W ?) = 8.95 is the mean of 
the W,?, o; =P, (1—P,), and P, is the proportion of 
the population in cell 7. The a, would minimize the mean 
square error of an estimated total if there was a single 
control variable and the squared correlation between the 
control variable and the dependent variable was about 0.9. 
Thus, the function exerts considerable pressure for the final 
estimate to be close to the control total. 

The QP solution to (20) gives a type of regression 
estimator. See Fuller (2002) and Fuller and Isaki (2001). 
Rao and Singh (1997) and Bardsley and Chambers (1984) 
consider related estimators. 

Using G(W) of (20) and the 13 linear constraints, the 
results in the final two columns of Table 2, under the 
heading “QPG”’, were obtained. As expected, the estimates 
are close to Census totals because the Census marginals 
were used as constraints. The relative percent differences 
between the QP estimate and the census count for the 67 
characteristics in G(W) of (20) ranged from -3.50% to 
3.75% with about 50 of the differences being less than one 
percent. 

The sample weights obtained by the two programming 
approaches are compared to those of the USCB’s household 
raking method in Table 3. The number and type of controls 
used under the USCB raking was not determined exactly 
because the number depends on the execution of the USCB 
collapsing procedure and on some preliminary files that are 
not readily available. However, we believe the number to be 
about 67 because the collapsing procedure used to form the 
67 cells is basically that used by the USCB. The QP 
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procedure used 82 controls and the QPG procedure used 90 
controls. The range of weights for the two QP methods are 
similar with a smaller range for raking. There are modest 
differences among the three sums of squares of the weights. 
The g(W) values are also similar, with the value for (20) 
being the largest. The g(W) value is the quantity being 
minimized by the weights of the first line of the table. The 
sum of squares of the weights for the QP of (20) could be 
reduced by reducing the a , in the objective function. 

We also used data from the 1990 Census to simulate the 
situation in which the controls come from adjusted census 
counts. For 1990, person estimates from the 1990 Post 
Enumeration Survey are available, but there are no housing 
unit estimates based on that survey. We call these estimates 
A.C.E. estimates. See Hogan (1993) and Isaki, Tsay and 
Fuller (2000). Estimates for WA 1788 were created by the 
QP method, using the A.C.E. estimates as controls. We used 
G(W) of (20) as the objective function with 63 age-race-sex- 
tenure person characteristics in the second term of the 
objective function and 11 person constraints. The person 
constraints are persons 0 to 4 years, 5 to 17 years, 18 to 44 
years, 45 to 64 years, 65 and over, total males, total 
Hispanic, total Black, total White, total Asian and total 
persons in rented units. 


Table 3 
Properties of Long Form Housing Unit Sample Weights 
in WA 1788 
Minimum Maximum 2 

Meee Weight Weight 2 Bicaee8 OW) 
QP with g(W) of (1) 1 26.5 78,028 326 

72 constraints 
QP with G (W) of (20) 1 isp] 78,672 383 

13 exact constraints 
Raking - a) 77,000 369 


Table 4 contains the estimates for WA 1788 identified as 
QPG and given as a percent of the census counts. The QPG 
estimates for these eleven person characteristics agree with 
the A.C.E estimates, except for rounding error. The standard 
errors reflect the error in the A.C.E estimates and, hence, are 
much larger than the standard deviation of rounding error. 
For example, the rounding error standard deviation for 
persons 18 - 44 is 0.06 in Table 2, while the standard error 
for the ACE estimate of persons 18 - 44 is 0.63. The QP 
estimates for household characteristics seem very reason- 
able. The estimated total number of households is 1.8% 
larger than the census count while the A.C.E. estimated 
number of persons is 2.0% larger than the census count. The 
quadratic programming total number of persons differs 
slightly from the A.C.E. estimate because of rounding of the 
weights. The difference is about 7% of the standard error. 
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Table 4 
The Census Count, A.C.E. Estimates and QP Estimates with A.C.E. Controls — WA 1788 


Census Count 


Housing unit characteristics 


With Own Children 4,349 
Not With Own Children 3,685 
With 1 to 4 Persons 6,785 
With 5* Persons 1,249 
Rented Unit 2,559 
Owned Unit 5,475 
Total 8,034 
Person characteristics 
Age 0-4 years 2,493 
Age 5-17 years 6,339 
Age 18—44 years 12,711 
Age 45—64 years 3,028 
Age 65*years 574 
Males 12-473 
Females 12,672 
Hispanic 2,385 
Not Hispanic 22,760 
Black 1,285 
White WPL ESTO 
Asian 257 
Remainder vai 
In Rented Unit 7,978 
In Owned Unit LE TOT 
Total 25,145 


5. CONCLUSIONS 


The QP method is shown to work well on actual USCB 
long form data. The QP single household weight method 
possesses several advantages over the USCB separate 
weights method. With one set of weights, there will be no 
confusion as to which weights to use for estimating a given 
characteristic. Also, estimates of relationships such as ratios 
of person characteristics to household characteristics are 


ACE QPG (7) s.e.(QPG) () 
Count Count Count 
_ 101.89 2.09 
_ 101.66 3.07 
— 102.03 2.03 
- 100.40 5.92 
-- 104.57 2262, 
— 100.47 1.50 
- 101.78 || 2p 
LOSSU7), 102.81 1.00 
103.09 103.08 0.96 
101.67 101.67 0.63 
100.26 100.33 0.59 
99.48 98.95 0.70 
102.18 102.01 0.68 
101.74 101.82 0.62 
104.95 104.91 1.09 
101.64 101.60 0.60 
104.59 104.82 1.01 
101.69 101.69 0.61 
100.00 101.95 1.95 
104.47 102.92 1.14 
104.25 104.21 0.89 
100.89 100.84 0.68 
101.96 101.91 0.57 


expected to be less variable when a single set of weights is 
used for both characteristics. 


Given that a single set of weights is easier to compute 
and easier for analysts to use, one would only construct two 
sets of weights if the weights designed for one type of 
characteristic give estimates with smaller variance for that 
type of characteristic. This did not seem to be the case in our 
example. The single set of QP weights gave favorable 
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results for both household and person characteristics when 
compared with the USCB weights for the specific category. 

The QP estimation module is computationally feasible 
and can replace the raking estimation module in the USCB 
operational setting. The QP method can produce long form 
sample weights for households in an adjustment situation in 
which only person controls are available. 
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APPENDIX 


Procedure used to define cells and initial 
weights W;'”’ 


We used the USCB’s procedure to determine the order in 
which cells are combined (collapsed). The cell collapsing 
rules specify that each cell contain at least 5 sample 
households. The procedure below is our extension of the 
USCB rules for defining W,. 

Let two cells under consideration be identified as Cell 1 
and Cell 2. 

i) Cell 1 is not to be collapsed and n,'N, < B, where N, 
is the Census count of households in Cell | and n, is 
the long form sample count in Cell 1. The constant B 
is provided by the sponsor and in our work, 27 is 
used. For household i in Cell 1, let 


W.?. = max{1.2;W>}, (A.1) 
where W, = min{Q, We Bie 


= 
OR= b “ N,, 


ie A, 
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and A, is the set of indices in Cell 1. The number 1.2 
is an arbitrary lower bound chosen greater than one 
and less than the minimum of W,” which is two. 
Note that the W,’” provides reasonable estimated 
totals for Cell 1. If n{'N, >B, collapse cell 1 with 
cell 2 as in 11) below. 

ii) Cells 1 and 2 are designated for collapse, (n; + n>) 
(Ni +N) S B, m + m = 5, and nj'N, > nj'N,. 
Then fori in Cell 1, w,? is defined by (A.1). For 7 in 
Cell 2; 


w.? = Max ie2. W, }, 


where 
W, = min{Q,W,”,, B}, 
4 
Qo. = b Hw (N,+N,-N,), 
ie A, 
and 
Nein a (ce 
ie A, 


The W,” in A,UA,, the union of cells 1 and 2, 
maintains the total households in A, UA, and also 
provide an estimated total for Cell 1 that is reasonably 
close to the true total. 


iii) Cells 1 and 2 are designated for collapse, nj + nz 2 5, 
and (n,; + no) | (N, + N2) > B. Then it is necessary to 
initiate further collapsing. The combined cell becomes 
the Cell 1 of case (ii). Continue cell collapsing until 
(ny + No Het (Ni + No + ..) < B. Case (iii) was not 
observed in the study data set. 


One could repeat the weight construction procedure in an 
iterative manner by using the W,” as W,”” in a second 
cycle. We tried a second cycle on the data described in the 
text. There was no discernable improvement in the estimates 
from using a second cycle. 
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Properties of the Weighting Cell Estimator Under a 
Nonparametric Response Mechanism 


D. NASCIMENTO DA SILVA and JEAN D. OPSOMER' 


ABSTRACT 


The weighting cell estimator corrects for unit nonresponse by dividing the sample into homogeneous groups (cells) and 
applying a ratio correction to the respondents within each cell. Previous studies of the statistical properties of weighting 
cell estimators have assumed that these cells correspond to known population cells with homogeneous characteristics. In 
this article, we study the properties of the weighting cell estimator under a response probability model that does not require 
correct specification of homogeneous population cells. Instead, we assume that the response probabilities are a smooth but 
otherwise unspecified function of a known auxiliary variable. Under this more general model, we study the robustness of 
the weighting cell estimator against model misspecification. We show that, even when the population cells are unknown, 
the estimator is consistent with respect to the sampling design and the response model. We describe the effect of the number 
of weighting cells on the asymptotic properties of the estimator. Simulation experiments explore the finite sample properties 
of the estimator. We conclude with some guidance on how to select the size and number of cells for practical 
implementation of weighting cell estimation when those cells cannot be specified a priori. 


KEY WORDS: Finite population asymptotics; Quasi-randomization inference; Weighting cell selection. 


1. INTRODUCTION 


Item and unit nonresponse occur in almost all large-scale 
surveys, and proper estimation techniques need to account 
for it. While item nonresponse is often dealt with through 
imputation, unit nonresponse is most often accounted for 
through weighting adjustments. Cell weighting adjustments 
for nonresponse have been applied since at least the 1950s 
in survey estimation, e.g. U.S. Bureau of the Census (1963, 
page 53), and continue to be widely used in practice today, 
because they have intuitive appeal and are relatively easy to 
implement in practice. Reviews of common weighting 
procedures are given in Kalton (1983) and Kalton and 
Kasprzyk (1986). A number of authors have studied the 
properties of the weighting cell estimator under a variety of 
theoretical frameworks. Oh and Scheuren (1983) derive the 
mean and variance of the weighting cell estimator under 
simple random sampling, conditional on the sample size 
and the number of respondents in each cell. See also Kalton 
and Maligalig (1991). Sarndal, Swensson and Wretman 
(1992, page 578) use the term “response homogeneity 
group” for cells in which the nonresponse is assumed to be 
constant, and derive the properties of the resulting 
weighting cell estimator for general designs. The recently 
introduced fully efficient fractional imputation (FEFI) of 
Kim and Fuller (1999) can also be expressed as a weighting 
cell estimator, and these authors derive its model properties 
under the assumption that the variables are independent and 
identically distributed (iid) within each cell. 


A D. Nascimento Da Silva, 


While the specific assumptions vary, a common thread 
among all these results is that the weighting cells are 
correctly specified, in the sense that units within each cell 
are indeed fully “exchangeable” (the precise definition of 
this term depends on the framework selected: equal 
response probabilities for randomization-based inference, 
or iid observations for model-based inference). In the 
terminology of Little and Rubin (2002, Chapter 1), this is 
the case of observations missing at random (MAR), where 
auxiliary information (i.e., cell membership in this case) can 
be used to correct the inference for the nonresponse. 

In this article, we depart from this framework. We will 
assume that the response mechanism depends on a known 
continuous auxiliary variable, but the exact functional form 
of this relationship is left almost completely unspecified 
(details on this nonparametric response mechanism are 
provided in the next section). Knowledge of such a variable 
could be used to construct more sophisticated nonresponse 
adjustments such as propensity weighting (Cassel, Sarndal 
and Wretman (1983), Little (1986), and Da Silva and 
Opsomer (2003)) or post-stratification, but we will instead 
limit our use of this auxiliary variable to the division of the 
population into weighting cells. Our primary goal with this 
approach is to study the robustness of the popular weighting 
cell estimator to model misspecification, and in particular, 
the effect of the number of cells. Hence, in contrast to the 
approach of the authors discussed above, the weighting 
cells are used as a practical way to construct an survey 
estimator, but they will not be assumed as part of the 
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statistical framework. This is similar to the “adjustment by 
subclassification” idea proposed by Cochran (1968) for 
removing the bias due to a continuous covariate in 
observational studies. 

We will study the properties of the estimator under 
quasi-randomization, a term used by Oh and Scheuren 
(1983) to denote joint inference under the sampling design 
and the response mechanism. The asymptotic properties of 
the estimator will be established by embedding the finite 
population and the corresponding sampling design and 
response mechanism in a sequence of such populations and 
random mechanisms, as will be explained in later sections. 
This asymptotic framework is very similar to that advocated 
by Hansen, Madow and Tepping (1983) and used in Isaki 
and Fuller (1982), among others. 

The remainder of this paper is as follows. In section 2, 
we introduce the notation and framework for the sampling 
design and the nonresponse model, and discuss the 
weighting cell estimator. In the following section, we derive 
the asymptotic design properties of the estimator. In section 
4, we report on a simulation study to examine the practical 
behavior of the estimator, compare its practical behavior 
with that predicted by the asymptotic theory, and provide 
some guidance on the choice of the weighting cells. 


2. THE WEIGHTING CELL ESTIMATOR 


Before describing the weighting cell estimator, we 
introduce our survey design framework and the response 
generating mechanism. We consider a_ population 
U = {1,2,...,N}, where N is finite and known. For every 
element i in U, let NEN OG Li? Ys ae Yj) be the associated 
vector of values of p characteristics of interest, 
Te lost, Ye Likewise, let X, = (X, ;> X, i» ..,X,;) be the 
vector of values of g auxiliary variables, X,, X,, wy XQ, 
corresponding to the i" unit, i¢ U. We assume that X ; Is 
known Vie U. If p=1, we denote Y, by Y, and, for 
q=1, X; is used to denote X,. Let s represent a sample 
drawn from U according to some sampling design p(-). This 
sampling design p(-) is chosen by the survey sampler and 
may be based on information available in the X abe U. 

The goal of the sample survey is to estimate unknown 
population quantities such as the population mean or total, 
or a function of these quantities. To simplify the presen- 
tation, we will focus on the estimation of the population 
total of the Y,,, 


Ae 
U 


When there is no nonresponse, this quantity will be 
estimated by a sample-based estimator of the form 


7 = » Ww; Ye » w.Y I; (1) 


where the w,, i€ s, are the sampling weights and /; is an 
indicator for whether the i" unit is in the sample or not. In 
this article, we will assume that the sampling weights are 
the inverse of the inclusion probabilities, or w, = jie . with 
m,=Pr(ies), so that the estimator (1) is the classical 
Horvitz-Thompson estimator (Horvitz and Thompson 
195.2) nexlsOmletehi= (1 coles expt pe represent the vector of 
inclusion indicators for the population. 

In the context of nonresponse, it is convenient to assume 
that each unit in the population is either a respondent or a 
nonrespondent for the variable of interest Y. Consider the 
vector K = (K,, R,, wR where R, indicates if the i 
unit is a respondent or not. The distribution of R is called 
the response mechanism. In analogy to the definition of the 
sample s, we use rc U to denote the (realized) set of 
respondents in the population, i.e., those elements for which 
R,=1. Since the distribution of r and R is typically 
unknown and can in principle depend on the realized value 
of IJ as well as on the Y, we need to assume a model for the 
response mechanism. When this assumed model is used to 
develop an estimator for a population quantity, the 
properties of this estimator become dependent on the 
response model. Hence, a misspecified model for R has the 
potential to cause significant and difficult to measure bias 
in both the estimator and its associated measures of 
precision. To avoid this problem, we will keep the response 
mechanism quite general in this article. Specifically, we 
will assume that the R, are independent Bernoulli variables 
with 


Pr{R, = 1|L,Y} =9,, 0<9,<1,VieU, 


and that the @, can be written as 9; = 9 (X,), with @(-) a 
continuous and differentiable but otherwise unspecified 
function of the X,. Note that this includes the uniform 
response mechanism, where 9,=@ for all i¢U, as a 
special case. 

When some of the selected elements do not respond, the 
estimator (1) can no longer be computed, and an estimator 
that includes a nonresponse adjustment is required. In this 
article, we are using the weighting cell estimator for this 
purpose. For simplicity, we will describe the situation in 
which both the Y, and X, are univariate variables, but the 
approach can be generalized to the multi-dimensional case. 
Let s, = sr represent the subset of the selected elements 
that actually respond to the survey. 

Let U,,g =1,...,G, represent G groups obtained by 
dividing the population into groups based on the values of 
the known auxiliary variable X. Specific implementations 
might generate groups of equal size, or divide the range of 
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X into equal-length intervals. We shall leave the 
implementation unspecified for now, and state some general 
assumptions about G and the size of the groups in the next 
section. Note that we are considering the groups as fixed 
with respect to the sampling design and the response 
mechanism, which excludes the situation in which groups 
are formed based on the observed sample values 
{X,:ie€s}. This was done primarily to simplify the 
theoretical derivations, and is similar to the approach of 
Sarndal et al. (1992) and Kim and Fuller (1999), among 
others. 

Lets, =sOU, be the portion of the sample that falls in 
group g, and define similarly Ses Sy lat U,. The weighting 
cell estimator is defined as 


= BV a, 2 
2 wy, x wi, (2) 


From this expression, is it easy to see that in each group, the 
estimator of the group total is ratio-adjusted by the inverse 
of the weighted proportion of respondents in the cell. This 
estimator is also the FEFI estimator of Kim and Fuller 
(1999). The properties of this estimator will be studied in 
next section. 


3. PROPERTIES UNDER 
QUASI-RANDOMIZA TION 


3.1 Asymptotic Framework and Assumptions 


The quasi-randomization properties of the weighting cell 
estimator will be studied in the usual finite population 
asymptotic context, in which the population U is treated as 
an element in an increasing sequence U,, U,,...,U, with 
vo, with a corresponding sequence of sampling designs 
p,(-) (see Isaki and Fuller (1982) for an early example of 
this framework). Let N be the size of the v" population 
with N,>N,_,, let ¥, = (Y, ¥,,..-, Yy )/denote the set of 
values of the characteristic of interest, Y, associated with 
U,,, and similatly, X= (Xe, X05. -0, Xy )- We assume that 
X , is known. For each v, a sample of size n(n, 2 n,_,) is 
selected from U,,, according to a sampling design p,(-). As 
before: let t= (i) ee awl ) be the corresponding sample 
inclusion vector. We will denote the K“ order central 


moment of the sample membership indicators I, ae I, by 
K 
aves ieee SV eC rercou sas Wie (3) 
pr te k k 


Itis assumed that U,, can be divided into G,,(G, 2 G,_,) 
mutually exclusive and exhaustive groups, U,, 
g =1,...,G,. These groups are constructed by sorting the 
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population according to their X values and dividing the 
population into G, groups. We will assume that there are at 
least G,, distinct values among the elements of X . Let ay 
represent the number of elements in U,. 

As mentioned in the previous section, we are treating the 
groups as fixed with respect to the population. The problem 
created by this approach is that in general, there is a 
non-zero chance of obtaining a group without any 
respondents. We solve this problem by adding a small 
constant in the denominators in each of the groups, or 


G 
bias = ‘ sia a 


»P wil. (4 
g=1 | max >» w,N,G,n,- ‘a 
r,g 


LES, 


Ax 


Hence, the difference between f,,. and ie in (2) is 
asymptotically negligible. This is similar to what is often 
done in practice to avoid overly large weights in ratio 
estimation. 

Fuller and Kim (2003) give the limiting distribution of 
the FEFI estimator under the assumption that the response 
probabilities are constant within these cells. We will study 
the case where the response probabilities are a smooth 
function of an auxiliary variable and the number of cells are 
allowed to vary. Let R, = (R,, R,, .... Ry )’ be the response 
indicator vector for the v" population. We assume that the 
distribution of R, satisfies the nonparametric response 
mechanism assumptions, specified as follows: 


(RI) Ren koges, Ri are independent random variables, 

(R2) Pr{R,=1|1,,¥,} -9,.WieU,, 

(R3) 0, = 9(X,) VieU,, where ¢(-) is differentiable with 
bounded first derivative, and the X, €[x,,, x,,], with 
X_,»Xy fixed constants and x, <x,,. 


The remaining assumptions are technical conditions that 
will be used extensively in the proofs. We assume that there 


are positive constants A,, A,, ..., A, such that: 


(Al) A,<N,n,' 1,<4,<~, VieU,, and 
nN," + 7E(0, 1), as v > &; 


UAZerordistrcet.. =<, t= U-, be hs plemen ss 


(TE (W-k +b) nk 2 if K is even 


3? 


A; sesate | SB 
oe (TE vk + Dy) PP2,, if K is odd 
(A3) lim, Lieu, 9 = 9p V8 = 1.2,0G, and v 2 1; 


g 
(A4) max;<y |Y; fone 


(A5) Ag <min;-y eas et 
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(AB) RpOSSNN, SAGl Wy B22 as 


(A7) 1<5G,<njdo, withO<y< 1/2. 


Assumptions (A1) — (A2) imply that, asymptotically, the 
sampling design is “well behaved,” in the sense that the 
moments of the sample membership indicators are of the 
same order of magnitude as those in simple random 
sampling without replacement. This is a common 
assumption in finite population asymptotic theory. (A1) also 
requires that the sampling fraction converges to a constant 
in the interval (0, 1). The boundedness assumption (A4) on 
the observations will significantly simplify the proofs for 
some of the theorems in the article, and could be relaxed to 
the existence of bounded moments if desired. Similarly, 
some technical regularity conditions are required to avoid 
degenerate response mechanisms: (A3) provides that the 
limit for the average response probability in a cell exists, 
and (A5) excludes the situation in which some units might 
have 9, = 0. Finally, assumptions (A6) and (A7) on the 
weighting cells require that all the cells grow at a similar 
rate, and that the total number of cells does not increase 
“too fast’ relative to the sample size. 


3.2 Main Results 


The approach we will use in the study of the properties 
of the weighting cell estimator follows that commonly used 
in the study of finite population estimators. First, we show 
the asymptotic equivalence between the non-linear 
weighting cell estimator and a “linearized” approximation. 
Next, we derive the mean squared error properties of the 
linearized estimator and consider those as the asymptotic 
properties of the weighting cell estimator or, more 
precisely, the properties of the asymptotic distribution of 
the weighting cell estimator. See, for instance, Sarndal et al. 
(1992, Chapter 5) for a description of this approach. 

The following theorem formally states our first results. 
The proof is in the appendix. 


Theorem 3.1. Consider the sequence of populations 
{U,: v2 1}. Assume that for each v, a probabilistic sample 
of fixed size n,(n,2n,_,) is selected from U,, according to 
sampling design p,(:), and that the response mechanism 
satisfies the conditions (R1) — (R2). Finally, assume that 
(Al) —(A7) hold. Then, the estimator a is asymptotically 


equivalent to a linearized random variable t Loco the 
sense that 

1 a: = i =I 

— (lobar i= O,(G,n, ). (5) 


Vv 


The bias and variance of fore /N,, are given by 


~ cy a 
E| we Ves a Ye ed Lik) i) 
NN N,, 2=1 U, 9, 
and 
Z 1 GaeaG ore 
fag) a 
N, No 8=1 g!=1 U, UP) jadi 
G, 
fe ihe nm (19) BG Yee) 
N, g=l U, 0, 
where 
= ] = 1 ~ Ne 9, 
Pon 25% Q;, A — xy; Ye : 
As U, ae U, du, Q; 
and 
iS Y.-Y )+oY 
YS oi; :) ies : VieU, ONG IN Sisal eens Gee 


Remark 1. The asymptotic equivalence between Pe and 
ty depends on the number of groups G,, with a faster 
convergence rate achieved when G, grows more slowly. 
The intuition behind this result is that the goodness of the 
linear approximation depends on how well the true cell ratio 
response adjustments © : are estimated by the sample-based 
estimators ) w,/X. w,. Since the cell ratios will be better 
estimators as ihe ns size grows larger, this would argue 
that G, should be chosen to be small, which corresponds 
to the current practice in applications of weighting cell 
estimation. However, as will be shown below, the MSE 
properties of rete under the nonparametric response 
mechanism improve as G, gets larger. A more detailed 
discussion of the selection of the number of groups will be 
provided after Theorem 3.2 below and in section 4. 


Remark 2. The results in Theorem 3.1 depend on the 
population groups U,, p=l;..AG, and omtheto, 7460, 
but do not rely on the fact that the response probabilities are 
a smooth function of the auxiliary variable X. Hence, the 
explicit expressions for the asymptotic bias and variance 
can be used to derive results for other response mechanisms 
that follow (R1) — (R2). In particular, results for the 
response homogeneity group model (see Sarndal et al. 
1992, page 577) follow directly from Theorem 3.1. This is 
also the model studied by Fuller and Kim (2003). Under 
that model, one assumes that 9, = 9, for all 
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ie U_,g = l,..., G, and it can easily be shown that the bias 
of ty is 0 and its variance is 


Vac| “© | ~Var-| 
Vv NN. 
c 
v 1 - “i ae 
pili Pe SB n, (Y, pe ae 
N? ae P, U, 


The first term in the variance is the variance of the 
estimator without nonresponse, and the second term 
represents the variance inflation caused by the nonresponse 
under a homogeneous within-cell response mechanism. 

The following corollary follows directly from Theorem 
3.1 and Fuller (1996, Theorem 5.2.1). A proof is given in 
the appendix. 


Corollary 3.1. Under the conditions of Theorem 3.1 with 
y < 1/2in (A7), for any sampling design p,(-) such that 


~ 


— je Yq By 
Ne 


12 
n,, 


L 
PHO WV Ds 


where B, corresponding to the bias of ey IN, given in 
Theorem 3.1 and 


V = lim _n, Var (ty, /N,) € (0, ©), 


Vier 


then 


~ NCO 1) 
N wens 


Vv 


-1/2 ; 
Wey aR 
v Vv 


Corollary 3.1 states that, whenever the linearized 
estimator rol achieves asymptotic normality, then so does 
Pl Since ty. can be written as a classical expansion 
estimator of the form (1), this result is quite general. 

Under the nonparametric response mechanism described 
in (R1) — (R3), it is possible to describe the effect of the 
number of groups G, on the asymptotic bias and variance 
of pou The next theorem gives the asymptotic rates for the 
bias and variance, and is proven in the appendix. 


Theorem 3.2. Assume that (R3) and the conditions of 
Theorem 3.1. Then, 


and 
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Remark 3. Theorem 3.2 shows that both the asymptotic 
bias and variance of the weighting cell estimator tea 
become smaller as the number of groups G, increases. An 
intuitive explanation of that fact is that the approximation 
of the function @, = 9(X;) by the step function 9, = , 
improves as the number of cells increases. The asymptotic 
variance has a term that is independent of G,. This 
“residual variance” is due to the inherent variability of the 
sampling design and the response mechanism, and cannot 
be reduced by changing G.,,. 


Remark 4. As noted in Remark 1, constructing a good 
linear approximation boot requires G, to be small, while 
Theorem 3.2 states that the MSE of £,,,. is minimized by 
taking G, as large as possible. Taken together, this can be 
interpreted to mean that, once the sample size in every cell 
is sufficiently large to obtain a “valid” ratio estimator for 
the average cell response probability @ : , itis preferable to 
increase the number of cells than to increase the sample size 
per cell. The simulation experiments discussed in section 4 
will further explore this recommendation. 

The following corollary follows directly from Corollary 
3.1, Theorem 3.2, and Chebyshev’s inequality, and 
establishes the consistency of the weighting cell estimator 
under the nonparametric response mechanism. 


Corollary 3.2. Under the conditions of Theorem 3.2, rae 
is a consistent estimator for t, in the sense that for any 
E>0O, 


Remark 5. As Corollary 3.2 shows, as long as a variable X 
can be found that is sufficiently related to the nonresponse, 
in the sense of assumptions (R1) — (R3), construction of 
weighting cells does not require knowledge of homo- 
geneous response probability cells in order to construct a 
consistent estimator. However, as discussed in Remarks | 
and 4, the choice of the number of cells still has an effect on 
the properties of the estimator. 


Remark 6. Assumption (R3) can easily be relaxed to allow 
for a small number of points of discontinuity in both (-) 
and its first derivative. A “small” number can mean that the 
number is either fixed as v > ~ or increases at a rate slower 
than G,. This would make it possible to account for 
situations such as stratified designs or the presence of 
domains within U,. The present theory can be extended 


50 Da Silva and Opsomer: Properties of the Weighting Cell Estimator Under a Nonparametric Response Mechanism 


directly to these situations, if the values for the variable X 
fall in non-overlapping segments for the different strata or 
domains. 


4. SIMULATION EXPERIMENTS 


4.1 Description of the Experiment 


In order to investigate the practical implications of the 
results of section 3, we carried out a Monte Carlo 
experiment on a fixed population of N = 3,000 units. We 
consider the case of one covariate, X, whose population 
values are generated as: 


Xe ke eke 


~ 1d UCOS 1). 

and two different variables of interest, Y, and Y,. We are 
interested in evaluating the effects of (1) the (model) 
relationship between Y and X, (2) the response mechanism 
¢(X), (3) the sample size n and (4) the number of cells G, on 
the bias and on the mean square error of the poe estimator. 
Since our theoretical results rely on the approximation of 
rae (or fies) by a linearized estimator ie we will also 
compare the behavior of t,,./N, and t,,. /N, as estimators 
of the population mean, Y, = N,' ~Y,. Finally, we 


compare t,,./N, to the “naive” estimator of the mean, 
which is defined for the variable Y as: 


+3 Wel 
— iP 1ES, l Ll 
yi ; 
Sar 
1ES, l 


corresponding to a ratio adjustment of the respondent 
sample to the original sample. This estimator'is appropriate 
under the assumption of uniform response mechanism or, 
to use the terminology of Little and Rubin (2002, chapter 
1), when observations are missing completely at random 
(MCAR). Note that y, is equivalent to the weighting cell 
estimator with a single cell. 

The levels of the four factors used in the experiment are 
given in Table 1. The “levels” of the variable Y correspond 
to two populations of independent values. The variable Y, 
was generated as N(40, 58), truncated to -3 to +3 standard 
deviations, corresponding to the “white noise” case. The 
variable Y, is related to X and was generated through the 
linear model Y, = 27.12 + 26.06X + &, where ¢ ~ N(0, 9). 
The population mean and variance for the two variables 
were, respectively, (39.9, 55.3) for Y,, and (40.0, 63.9) for 
Oe 

~ The four levels of the response mechanisms contain two 
different scenarios regarding the response probabilities: 
constant (C1, C2), and linearly related to X (L1, L2). The 
response probabilities are: 


jp COae0s 
- Q(X) =0.5 
= OF (Xx) = 0.20% 0.60% 


~ @(X) = 0.65 + 0.30X 


The levels of the linear response mechanisms were chosen 
so that the average probabilities (over X) were approxima- 
tely equal to 0.5 and 0.8, respectively. 


Table 1 
Overview of Factors in the Simulation Experiment 
Factor Levels 
Y variable J 
Response mechanism 9(°) CI e2r Fie 
Sample size n 200, 500 
Number of cells G Dp By SiS 


For a given G, the groups were created by dividing the 
range of X into G equal segments and assigning the element 
i to the group g if the value X, was in the g'" segment, 
1=1,2,..,N and g=1,2,...,G.The simulations were 
carried out through a completely randomized factorial 
experiment 2x4x2x4. For each combination of the levels 
of the factors in Table 1,B = 5,000 independent realizations 
of the vector indicator of responses, R = (R,, R,, ..., Rae 
were generated according to the corresponding response 
mechanism. For each one of such realizations, a simple 
random sample (without replacement and of size n), s, was 
selected from the overall population. Within each selected 
sample, the respondents were the values of i € s such that 
Rie le 

This procedure could in principle lead to a group not 
containing any sampled and responding element, in which 
case the weighting cell estimator (ignoring the adjustment 
in (4)) cannot be computed. If that happened, the realization 
was discarded and a new sample drawn from the popu- 
lation. Out of the 5,000 repetitions for each combination of 
factors, this happened 13 times in the factor combination 
Oe @, 1» 200, 8) and 15 times with (Y,, @, ,, 200, 8). It did 
not occur with any of the other factor combinations. Hence, 
the number of samples discarded was very small and this 
has a negligible effect on the simulation results. 

With n = 200 and G = 8, we expect approximately 25 
sampled elements in each cell, to be further reduced by the 
nonresponse. Since the estimator relies on ratio estimation 
in each cell, we judged this to be a reasonable lower bound 
on the number of observations per cell to consider in the 
simulations. In practice, a number of procedures could be 
used when groups have too few elements, such as picking 
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a smaller value for G or collapsing neighboring groups. We 
also implemented an estimator that collapses the empty cell 
with a neighboring cell as well as a version with a lower 
bound on the value of the denominator in the weighting 
adjustment (i.e., t,.), and the results are virtually 
indistinguishable from those reported below, so they will 
not be further discussed here. 


4.2 Results 


Tables 2 and 3 show the simulated bias of the weighting 
cell estimator for the variables Y, and Y, as a fraction of 
the standard deviation. As a comparison, the last column of 
Tables 2 and 3 displays the bias of the naive estimator, y,. 
The bias as a fraction of the standard deviation, referred to 
here as the relative bias, 


EG oy) 


RB(tyco.t,) = ————— 
AieaiVar(te)y 


was also used in Cochran (1977, page 14), where it is 
shown that as the relative bias increases, inferential results 
rapidly become unreliable. In a simple simulation example, 
Cochran (1977) shows that a relative bias of +0.50 or more 
leads to highly inaccurate 95% confidence intervals. 

For Y, (Table 2), the relative bias of the weighting cell 
estimator is small and is similar to the relative bias of the 
naive estimator, for all sample sizes, response mechanisms 
and cells sizes considered. For the variable Y, (Table 3), 
similar results hold when the response mechanism is 
uniform (C1, C2). However, when the response probabi- 
lities are a linear function of X (L1, L2), the naive estimator 
becomes severely biased. This relative bias decreases as the 
number of cells increases, and three to five cells appear 
sufficient to remove most of the bias. This finding agrees 
with that of Cochran (1968) in the context of bias reduction 
for observational studies. 


Table 2 
Relative Bias of the Weighting Cell and Naive Estimators 
for the Mean Y, 


Sample Response Number of Cells Naive 
size mechanism Z 3 “) 8 estimator 
Cl -0.00 -0.01 0.01 0.01 -0.00 
C2 0.01 -0.00 -0.01 0.00 0.00 
mets il -0.02 0.03 -0.04 -0.01 -0.00 
Ee -0.00 -0.02 0.00 -0.02 — -0.00 
CI -0.00 -0.01 0.04 -0.01 0.00 
C2 0.01 0.02 -0.01 -0.01 0.00 
aut sil 0.05 0.02 -0.01 -0.02 0.01 
| By 0.01 0.01 -0.00 -0.01 0.01 


DI 
Table 3 
Relative Bias of the Weighting Cell and Naive Estimators 
for the Mean of Y, 
Sample Response Number of Cells Naive 
size mechanism 2 3 5 8 estimator 
Cl 0.01 -0.01 -0.02 0.02 -0.01 
G2 -0.03  -0.00 0.02 0.01 -0.00 
200 
| LAO 9 e022 ULF a1 
br G30" O18" 0.06") -G:03 1.36 
C1 0.01 0.01 -0.02 -0.00 0.00 
C2 0.02 -0.00 -0.00 -0.01 -0.01 
500 
Ll LOSE90.962 10:32- O05 5.84 
|W D615) 029. - 0.09 » 20:07 2.26 


Hence, when the variable of interest is totally unrelated 
to the response mechanism, as in the cases of Y, under all 
mechanisms considered and of Y, under the uniform 
response mechanism, the bias does not depend on the 
number of cells. When the variable of interest and the 
response mechanism are related, multiple cells are required 
to remove the bias. 

The relative mean squared error (RMSE) for the two 
variables of interest, defined as the MSE of the weighting 
cell estimator divided by the MSE of the estimator with no 
non-response, 


Mit BG tee 
RMSE (lwo, t,) = “ve zy 
Get) 


are in Tables 4 and 5. In these tables, the last column again 
corresponds to the relative MSE of the naive estimator. 
Note that with the exception of the two LI cases for 
variable Y,, the Tables 4 and 5 are really variance tables, 
since the bias is so small. 

For Y, (Table 4), the variable uncorrelated with X, the 
number of cells has relatively little effect on the relative 
mean square error, with results around 2.3 for a 50% 
response rate, and around 1.3 for the 80% rate. However, a 
relatively modest increase in MSE is observed, especially 
for the high nonresponse cases (C1, L1). For ¥, (Table 5), 
the variable correlated with X, increasing the number of 
cells improves the results for all response mechanisms, but 
the effect is much more pronounced when the response 
mechanism is also correlated with the variable of interest. 
As for the relative bias, three to five cells achieve most of 
the efficiency gain, while the naive estimator is extremely 
inefficient. 
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Table 4 
Relative Mean Squared Error of the Weighting Cell Estimator 
Compared to the Estimator Without Nonresponse for Y, 


Sample Response Number of Cells Naive 

size mechanism 2 s) 3 8 estimator 
Ci 2.02tee Sez Th. Bet 2.08 
C2 P25 aS IRE29 = d28 1.28 

ie Lil DSA SD pd Decline Oli ne DQ) 2.08 
ee. 130s) e220 ve 129% 3k 1.28 
Gl Dia 2h kD TS 2ST 203 
e NSO" 1320" 34 29 1.30 

20 |B Zoo 20 Oc eau Dred, 
12 1 SEM 3S 133, 4 | Wwe! 

Table 5 


Relative Mean Squared Error of the Weighting Cell Estimator 
Relative to the Estimator Without Nonresponse for Y, 


Sample Response Number of Cells Naive 
size mechanism 2 3 5 8 estimator 
Ch 133 Silke eek 1O027ae07 2.07 
C2 PO9? «1:05 ° "2.02002 1.26 
200 
EL Ce SO sy lll Ps ent we Gini 4 oe 
[SY L235 90 Od op AOSR LOL Jd 
Cl LSSeeielo SiO 2 b.08 Daan 
©2 OOS 05 LOS. 103 1.30 
500 
al G60) 2330) 27025 7 ts | 69:75 
12 SUS FI4. 1.04, .-1.02 7.83 


The difference between the results for both variables is 
surprising at first, but it can be explained using the results 
from section 3. Clearly, the results for Y, follow the 
asymptotic theory, in that the MSE improves as the number 
of cells improves (as long as sufficient observations are 
available in each cell). In the case of Y,, note first that the 
bias is negligible relative to the standard deviation for all 
values of G (see Table 2), so that the change in MSE is due 
almost exclusively to differences in variance. It turns out 
that when a variable is 1id in the population and sampling is 
equal-probability, the asymptotic variance in Theorem 3.1 
is relatively insensitive to the number of cells. In that case, 
the increase in MSE is influenced by the variability implied 
in the linear approximation in Theorem 3.1, which increases 
with the number of cells. 

The theory described in this article applies to response 
functions that can have arbitrary smooth shape. In order to 
evaluate results for more complicated functions, we also 


created a variable Y,=25+95X - 95X *+¢, where 
e ~ N(O,3), so that the Y, has mean 40.9 and variance 
51.8, and two additional quadratic response mechanisms 


= On (x) ~ OU! + 1.96X - 1.96X? 
= Pg) (X) = 0.50 + 1.80X - 1.80X°*. 


The results (not shown) broadly reflect the findings for the 
previous variables. When the response mechanism and the 
variables are correlated (the linear variable is correlated 
with the linear response mechanism, and the quadratic 
variable is correlated with the linear and quadratic response 
mechanisms), significant bias occurs but can be removed by 
increasing the number of cells. In the case of the quadratic 
response mechanism and the quadratic variable, eight or 
more cells appear to be required to remove the bias. 
Similarly, the relative efficiency improves for all response 
mechanisms for both the linear (Y. 4) and quadratic variable, 
with the most dramatic results found for the linear 
variable/linear response and quadratic variable/quadratic 
response cases. 

In the previous sections of this article, we approximated 
the weighting cell estimator by a “linearized” estimator 
si and then derived the asymptotic properties of that 
estimator. It is therefore of interest to compare the statistical 
properties of both estimators in simulated settings. For all 
the scenarios in Table 1, we calculated the relative effi- 
ciencies of the weighting cell estimator compared to the 
linearized estimator. These relative efficiencies were all 
close to 1.00, with the largest deviation being a value of 
1.08. Hence, the statistical properties of weighting cell 
estimator appear to be well approximated by those of the 
linearized estimator. 


5. CONCLUSIONS 


We have shown that the weighting cell estimator, 
corresponding also to the FEFI estimator proposed by Kim 
and Fuller (1999), is consistent with respect to the sampling 
design and a nonparametric response model. That model 
does not require the correct specification of homogeneous 
response probability cells, as long as a variable related to 
the response probability can be identified. 

The statistical properties of the estimator depend on the 
number of cells used in the estimation, but the relationship 
is rather complex. Asymptotically, there appears to be a 
trade-off between the goodness of the approximation of the 
weighting cell estimator by a linearized estimator, which 
requires a small number of cells, and the mean squared 
error of that linearized estimator, which is reduced when a 
large number of cells are used. While useful in under- 
standing the asymptotic behavior of the estimator, these 
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findings only provide limited guidance for choosing the 
number of cells for a particular survey. However, these 
findings show that reliable inference for weighting cell 
estimators will require cells with reasonable sample sizes, 
because variance estimates typically rely on the variance of 
the linearized estimator as an approximation of the variance 
of the weighting cell estimator. 

The simulation experiments show that when the variable 
of interest and the response mechanism are uncorrelated, 
the number of cells has virtually no effect on the design bias 
of the estimator. When the variable of interest and the 
response mechanism are uncorrelated, even the estimator 
with a single weighting cell (corresponding to a simple ratio 
adjustment) is essentially unbiased, while models with 
multiple cells perform equally well. When the response 
mechanism and the variable of interest are related, however, 
the bias properties of the weighting cell estimator depend 
critically on the number of cells. In particular, estimators 
with a single cell are severely biased, but even a relatively 
small number of cells is sufficient to reduce both the bias 
and variance of the estimator. This result holds for both 
linear and nonlinear relationships between the response 
mechanism and the variable of interest. 

The design efficiency of estimators depends on the 
relationship between the variable of interest and the 
variable(s) used to form weighting cells. When those two 
variables are uncorrelated, the number of cells has no effect 
on the efficiency of the estimator. Conversely, when those 
two variables are correlated, increasing the number of cells 
improves the design efficiency of the estimator. Even a 
small number of cells dramatically improves the 
performance of the estimator. 

Overall, it appears that in the presence of nonresponse, 
forming at least a small number of weighting cells based on 
a variable related to the non-response provides a good 
“insurance policy” against design bias and design ineffi- 
ciency. This article has shown that this adjustment does not 
require the assumption that the cells be based on a priori 
knowledge of constant nonresponse groups. The resulting 
weighting cell estimator will never perform worse than the 
naive estimator with a single ratio adjustment for the whole 
sample, and it might perform significantly better. 
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APPENDIX 


Derivations of Theoretical Results 


Lemma 1. Assume that the conditions (Al) — (A3) and 
(RI) —(R2) hold. For i,, by, adpe:Us, define 


k 
bs atta | IG, ms m0) , 


where , = @(X;). Consider the A, i, of (3). Let A’ 
denotes the r-fold Cartesian product a the set A, where r is 


a fixed positive integer, Ae 

= dis A Cae yal eile eh GR 
Apert Ss isle oop at and Ndi 
{(i,,1,,--..4,) € U,: exactly k components are distinct}, 
k=72,3,%% Lnensfor Ts; 


O(Nen,*), if k=5 


dougie O(Nen,”), if k=6 

Nong Sma GG Paty Arg eal) y= ver. 
okey Pines ahd ww, NOW, ), if KT 
O(n,*) , if k=8. 


Proof of Lemma 1. See Da Silva (2003). 


Lemma 2. Suppose the conditions of pepe 3.1 hold. 


Consider the vectors t= (t, x ty gol neni ae a 1, 
(iene bh mced ae tate, aye with 
a rest IN, Gin: Lett, , -E@,). Then for all 


aE R Sine 
= 8 
~ Elf, -t,,1E 
& 


Proof of Lemma 2: See Da Silva (2003). 


Proof of Theorem 3.1: Consider the proof of (5). Let 
GANG...) ve R> and hA:R°-R, where A(a) = 
a, a, ide a,# 0. Define 


lé,, tl) = O((G Jn, 4), 


3 
- -1 -1 
n,,(@) = A(N, 't,,) = PN Taran S), 


where h©@: BAC Vouk Eee = h(a) ~ n,,(a). Note 


that 4, oi NAN, - oh and hence, detining the 
“linearized” ae =>)" iN ri A. by »)» we can 
write 

= Ste mercer 

N WC WC v v? 
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where 
G, 
RSL PERM NE 
ak sewers ent & ay 
and 
inc 
x -1A* ai 
Mg wee ” hn, (Me bgy) — Ney, é..)) 


Consider first the term ,. Observe that 
=i =1A 
[Te Ve Eo) — Ney WV, el 


ieee 
merits, «fa | 
N, 


ROW, 1 t.,) | 
By (A4) and (A5), it is straightforward to check that h(-) 
and h“(-), k = 1,2,3, are O(1) when evaluated at Nagle. 
for all g ae 2,...,.G,. Since by construction, we have 
1/N, | ae ie: in... we conclude that |n,| = 
O(G,/n,). ein to SOMES the proof of (5), it remains to 
show that é, AON (ER ae Let f,,(@) = (e,,(a))°. By the C, 
the inequality (Sen Aa Singer 1993, page eli), 


If,y(@)/? < 5° (|h(a)|*+ |AW,'t,, 14 


Yo ney, ; ks |a,-N, me 


Using (A1) and (A4), STE OIGRS bounding arguments 
show _ that |n(N, 2 4 =On,/G,)*) and _ that 
fe O(1) ia k =1,2, 3. Therefore, 
4 
St (Ox 
ERS O 
&Vv 4 
G 


Vv 


aes 


Since by ~ ule 2, Ne “E\t,, -t, sf" = O((n,/G,) i): and 
Vie ite ,) |? is continuous at any realization of N,. ti ae 
then the sequence eG ios) |7} satisfies the eeadiions 
of Theorem 5.4.4 (with n =1, p =4) of Fuller (1996, page 


247). Therefore, 
-] Ax 2 
E Foye toy) 


Now, from the Oar U IS of f,,,(-) and its derivatives up to 
ordenathree, a J, Ne ‘f ny otic the conditions of 
Theorem 5.4.3 Gath ays = a s=4 and a,=O(J/G, /n,) of 
Fuller (1996, pages 244-245). Hence, 


=O (1) evs gal Oe. 5 Gt 


2 
-] ~* 4 C 
Ef,,(N,'é,,) = O(a}) = 0] = 


n,, 


pf WEG 2, ao h Goes 


because iB. iC ) and all of its derivatives up to order three are 


zero at Ny ats ,- Therefore, we conclude that 


zi 1 ae 
Elé,| < —)N,Ele,,@, £,,)| 
N,v 
1 Be ti G, 
Soe SN BNE ay) cuca 
Ne U, Us 


which leads to e, 
Markov’s inequality. 
Expressions (6) and (7) are obtained by direct compu- 


3.0 AGe i, -) by an application of 


tation of the moments of the linear estimator bee under the 
sampling design and the response mechanism. 
Proof of Corollary 3.1: Let 
t. n 
dione! wc B 
Vv yie N, v v 
and 
eee we twe 
Peleg 7 U2 Ne iit 
where V, = Nar (am /N,,). Hence, 
; 1/2 ie 
Wael VA 
N, N, 
Since Vin, V,-1,as v > , then, 
1/2 t 
eS -ail V AB CS ae nal 
By yi n, V,, INe vi iy yi2 


where Z~ N(O, V). Also, (A7) with y < 1/2 implies that 
n, O,(G,n, ) =0,(1). Hence, by Theorem 3.1, 

five Syhwa =o (1). 
N, N, 4 


1 


yi2 


V 
n, Vy 


1/2 
1/2 
n 


Vv v 


The result of the corollary follows, therefore, from Fuller 
(1996, Theorem 5.2.1). 


Proof of Theorem 3.2: Fix a ge {1,2,...,G,}. The 
conditions of the theorem imply, by the Intermediate Value 
Theorem, that there exists X, , inside the interval defined by 
the HONS and the highest values of X,¢ al such that 
®, ae ep Q; = Q(X, 2): Also, by the mean Value 
Theor vie U,, 


OX) = O(X,) +" (CV(X, ~ Kop) 


where c * is between X, and Xo, So, 


iy X vuiy EX 
Jo; ~ ,|=| @' "1X; ~ Xog| $ CEO, (8) 
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for some constant C € (0, ~) and, by (A5) and (A6), 


ie =f Xww)y ~ Xi) 
Bias} —||/<CA, Xs : 
Observe now that since 
Pe ie, 
i; g i 9, 
then, by (A1, (A6) and (8), 
Y,,=O| —| +0] —~], VU,,Vg = 1,2,...G,, 
n n 


N2 2 
y= 0 wl »VU,Vg = 1,2,...,G,. 
n, 3G 


Using the facts that, by (A7), N, IN SOC G Gye Dy (AZ) 

and AS)" >) Dy APO (nblG) Mand aor” Bee, 
8g 8 

Uap oe ae On, /G; ), then, the first term of 

Var(ty./N,) is bounded by 

1 


Since the second terms of Var (ty /N,) is bounded by 
O(1/n,), the conclusion follows. 
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Variance Estimation with Hot Deck Imputation Using a Model 


J. MICHAEL BRICK, GRAHAM KALTON and JAE KWANG KIM ' 


ABSTRACT 


When imputation is used to assign values for missing items in sample surveys, naive methods of estimating the variances 
of survey estimates that treat the imputed values as if they were observed give biased variance estimates. This article 
addresses the problem of variance estimation for a linear estimator in which missing values are assigned by a single hot deck 
imputation (a form of imputation that is widely used in practice). We propose estimators of the variance of a linear hot deck 
imputed estimator using a decomposition of the total variance suggested by Sarndal (1992). A conditional approach to 
variance estimation is developed that is applicable to both weighted and unweighted hot deck imputation. Estimation of the 


variance of a domain estimator is also examined. 


KEY WORDS: Missing data; Model-assisted approach; Conditional variance estimation. 


1. INTRODUCTION 


The important practical problem of estimating the vari- 
ance of an estimate computed from a data set in which some 
of the items are missing and values are assigned by im- 
putation has been addressed in a number of different ways 
(e.g., see Rubin 1987 and Rao and Shao 1992). The ap- 
proach used in this article is based on the model-assisted 
approach introduced by Sarndal (1992). In the initial 
application, Sarndal used the model-assisted approach with 
a simple random sample in which the missing data were 
imputed using deterministic ratio imputation. Subsequently, 
the approach has been extended to other imputation meth- 
ods and sample designs (e.g., Deville and Sarndal 1994; 
Rancourt, Sarndal and Lee 1994; and Gagnon, Lee, 
Rancourt and Sdérndal 1996). This article extends the 
model-assisted approach to general forms of linear esti- 
mators in which missing values have been assigned by hot 
deck imputation within imputation cells. This form of hot 
deck imputation, which replaces a missing item by the value 
observed for a responding unit in the same cell, is one of the 
most frequently used methods of imputing for missing items 
in household sample surveys (Brick and Kalton 1996). This 
paper employs a conditional approach to develop a variance 
estimator for hot deck imputed estimators that is valid for 
general sample designs and a variety of estimation 
strategies. 

In the model-assisted approach, the difference between 
an imputed estimator (the term used here to denote an 
estimator based in part on imputed values), 6, and the 
corresponding finite population parameter, 0,,, is written as 


6,- 8, = (6, - 8, }+(6, -6,), (1) 


1 


where 6. is the usual, approximately design unbiased, 
estimator of 0,, with complete response. The first term on 
the right hand side of (1) is called the sampling error and 
depends only on the sampling distribution of the estimator 
based on the sample design used to select the full sample, 
denoted by p. The second term is the imputation error; it 
depends on the sampling distribution, the response mech- 
anism (R) that generates the respondents from the full 
sample, and the imputation mechanism (J ) for filling in the 
missing values. This paper is restricted to estimators 6 , that 
involve only one variable subject to missing data. 

We use a model-assisted approach that makes assump- 
tions about the distribution of the variable of interest in the 
population. We refer to these assumptions as a super- 
population model, denoted by €. In general, the aim of 
imputation is to create a multi-purpose data set that can be 
validly analyzed in many different ways, potentially in- 
volving the associations of a variable subject to imputation 
with any of the other variables in the data set. Since a 
superpopulation model is needed to impute for item non- 
responses in a way that preserves such associations, it is 
natural to use that approach also in variance estimation. 

Under the superpopulation model, the total variance for 
an imputed estimator is given by 


Vaor = E;E, Ep E;(6, - 9) (2) 
where E., E.. Es and E : refer to expectations with 
respect to the superpopulation model, the sampling mech- 
anism, the response mechanism, and the imputation 
mechanism, respectively. We assume that the sample de- 
sign, response mechanism, and the imputation mechanism 
are unconfounded as described by Rubin (1987) and used 
by Sarndal (1992) and all of the other literature cited above 
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on the model-assisted approach. Essentially, unconfounded 
mechanisms allow the order of the expectations to be 
changed so that the expectation with respect to the model 
can be taken first. Thus, the total variance can be re-written 
as Vege Hy epee, (6, - 6,,)°. Roughly speaking, uncon- 
founded sampling, response, and imputation mechanisms 
imply that the mechanisms are independent of the distribu- 
tion of the y-value being analyzed after conditioning on 
auxiliary variables (e.g., stratification variables for sam- 
pling or imputation cells for imputing). Thus, for example, 
we assume the value of the variable being imputed is inde- 
pendent of the probability of response within each hot-deck 
cell. Rubin (1987, pages 36-39) has a more detailed discus- 
sion of unconfounded mechanisms. 

Using the decomposition given in equation (1), Sarndal 
(1992) expressed the total variance for the imputed 
estimator as 


A 2 
Vior = Ee E, ERE, (6, ; 6, = Veam * Vip * 2x» (3) 


where Vo any - EE, (6, = Ge is the sampling variance, 
egasaes E, E, ERE fii - 6 He) is the imputation variance, and 


Wane “sh E, ey [(6, - 6) (6, ~ 9,,)] is a mixed compo- 
coe In ae aor the total variance and its com- 
ponents are more aptly described as anticipated variances 
because they incorporate the added expectation with respect 
to the superpopulation model. 

The model-assisted approach to variance estimation with 
imputed data used in this paper should be distinguished 
from model-assisted sampling (Sarndal, Swensson and 
Wretman 1992). With model-assisted sampling, models are 
used to guide the choice of efficient sample designs and 
estimators, but the validity of statistical inferences is not 
dependent on the validity of the models. In contrast, when 
some data are missing, reliance on models for inferences is 
essential, both for point estimators and for variance 
estimators for them. In this paper, the general approach to 
inference employs the imputation model assumptions (i.e., 
superpopulation model and unconfoundedness assump- 
tions) only to the extent necessary to account for imputed 
data. Both the point estimators and the variance estimators 
are the standard design-based estimators when no data are 
missing. Whether the variance estimators are approximately 
unbiased for V,,,, depends on the validity of the imputa- 
tion model. Also, the estimators for Verge Aa a etely. 
completely on the imputation model. Thus the validity of 
the model is much more critical with model-assisted vari- 
ance estimation with imputed data than it is with model- 
assisted sampling. Sarndal (1992) argues that if we are 
willing to accept the validity of the model in point esti- 
mation with imputed data, we should also be willing to 
accept its validity for variance estimation. 


Variance estimators are obtained by conditioning on the 
realized set of sampled units, responding units, and impu- 
tations. We develop estimators of © V¢,,, = 
E035 0,,)| Ag Ags d | Vines (8; -0,) As Ad dj, 
fi Vuvex = Es [(6,, ah) (6, -6 Sa ie A, d], where A and 
AR Henote Picks of indices for the sampled and re- 
sponding units, respectively, and d is the set of indices for 
the imputations. The conditioning is on the set of indices, 
not on the values of the units. The matrix d is an 
rX(n-r) matrix in which the rows refer to respondents 
and the columns to nonrespondents. In this paper, we 
consider only single imputation methods, in which case all 
but one of the d;,=0 in every column. The exception 
occurs in the row of the donor respondent when d;, =]. 

By considering the conditional expectations of V,,,, and 
Vuire the estimators reflect the number of times responding 
units are used as donors in the given application rather than 
taking the expectation over all possible imputation out- 
comes. We argue below that these are the appropriate vari- 
ances to estimate in a given application. If the variance esti- 
mators are conditionally unbiased, they are also, of course, 
unconditionally unbiased. 

A conditional approach is useful for two reasons. First, 
when an estimator is conditionally unbiased and consistent 
(as 0, is assumed to be for 6), the conditional variance is 
generally a more appropriate estimator for making infer- 
ences from a realized sample than an unconditional vari- 
ance (Holt and Smith 1979, Rao 1999, Kalton 2002). Thus, 
a variance estimator that conditions on the actual number of 
times each donor is used is to be preferred to a variance 
estimator that averages over all possible donor selections. 
Second, the results apply to any unconfounded sampling, 
response, and imputation mechanisms that produce the 
same set of sampled units, respondents, and imputations. 
Therefore, the results given below for hot deck imputation 
apply to any unconfounded imputation scheme that substi- 
tutes observed values for missing ones and for which 
Ep6 = ES (02). 


2. HOT DECK IMPUTATION 


We consider a simple model for which hot deck 
imputation is appropriate. Assume that the finite population 
(U) is composed of G classes or cells. Within cell g(g = 
1,...,G), the elements in U are realizations of indepen- 
dently and identically distributed random variables with 
mean H, and variance O.. This cell mean model can be 
written as 


¥(u,,07),i€ U,. (4) 
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where ~ is an abbreviation for independently and identi- 
cally distributed. 

A linear estimator of 0,, with complete item response 
from a complex sample survey can be written as 


», ¥i¥p (5) 
where w, is the weight that accounts for unequal selection 
probabilities and the estimation strategy. When the cell 
mean model holds, a more efficient ae of 0,, uses the 
unweighted group means, i.e., -LLw,; ;¥, Where 
ee > Voi! n,. However, the det assisted approach 
does not place complete reliance on the model; rather, it 
uses the standard design-based approach to the extent 
possible and the model is used only for the missing data. 
The weights in (5) can be the inverse of the probability of 
selection weights or calibration adjusted weights, as 
described below. 


The hot deck imputed value for y; is yy = dics d,, y; 
R 
and the imputed estimator is 
6= w= dD wt Dowd aye 6) 


icA icAp JEAy i€Ap 


where y,=y, for i€ A, and y,=y; for i€A,,. We 
assume throughout that imputed values are selected from 
respondents in the same imputation cells, and that each cell 
contains at least one respondent. 

This imputation formulation does not specify the way in 
which donors are selected. It thus covers both unweighted 
hot deck imputation in which donors are selected with equal 
probabilities within each cell and weighted hot deck 
imputation. Weighted hot decks are typically used when 
assumptions are made only about the response distribution. 
The form (6) also covers with and without replacement 
imputation methods. For example, it covers the common hot 
deck procedure in which a respondent is randomly selected 
to be a donor within a cell, and then that respondent is not 
used as a donor again until every other respondent in the 
cell has been used. 

While not explicitly considered here, nearest neighbor 
imputation procedures that use continuous variables to 
identify a small set of the most similar respondents and then 
randomly select one as the donor, satisfy the above require- 
ments. Furthermore, researchers often use hot deck methods 
even when continuous variables are available. Little (1986) 
discusses strategies for forming imputation cells using 
variables that are predictive of the y-variable and notes that 
imputation within cells and regression imputation should 
produce similar results in many circumstances. Cochran 
(1968) and Aigner, Goldberger and Kalton (1975) show 
that a relatively small number of well-constructed cells 
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formed from a continuous variable can capture a large 
proportion of the predictive power of the variable. 

The conditional bias of the imputed estimator under the 
cell mean model is 


E, (6, -6,|A, Ag d) = 
E. | » a (y} x 
JEAy 


y,)| A, Ap d | =0, 


since, ..£ 2 (Yj) = E, (DYicag di, y;) = died, d;, E. (y;) = 
ye om dit, alee o j in ‘cell g. This etter is 
conditioned fee the indices of the sampled units, the 
responding units, and the donors. However, since the 
estimator is conditionally unbiased for any sample, it is also 
unconditionally unbiased. Kim and Fuller (1999) also use 
this conditioning argument. Estimators for each component 
of the variance of the hot deck imputed estimator are given 


in the next section. 


3. ESTIMATION OF THE COMPONENTS OF 
THE TOTAL VARIANCE 


This section contains the main results about estimators 
of the three components of the total variance of a linear hot 
deck imputed estimator. Throughout, we assume uncon- 
founded sampling, response, and imputation mechanisms 
and a linear complete sample estimator of the form (5). The 
results require that the cell mean model holds and that there 
is at least one respondent in each imputation cell. We begin 
with the variance due to sampling, V.,y,- 

We assume that there exists a complete sample variance 
estimator, V,,. that is design unbiased for the sampling 
variance of 0, is a quadratic in the y-variable, and is of the 


form 
V, de Dy Vis -> DO iar y2 Dee OP 
LEAR ICA i<j 
i,jEA 


for known coefficients Q;,. This formulation covers the 
Horvitz-Thompson estimator, where the Q,, are determined 
by the single and joint probabilities of selection. It also 
covers the linearized variance estimator for the generalized 
regression (GREG) estimator. Rao, Yung and Hidiroglou 
(2002) show that the linearized variance estimator for the 
GREG estimator can be written by substituting g,e, for y, 
in the variance estimator for the Horvitz-Thompson esti- 
mator of a total. Here, g,, is the sample-dependent g-weight 
andve7= yx, ‘B, where x, is the vector of auxiliary 
variables and B is the vector of estimated regression 
coefficients. Since g,. is not a function of y and B is linear 
in the y-variable, g.e, is linear in y. Therefore, the 
linearized variance estimator for the GREG estimator is 
quadratic in y and can be expressed in the form given by 
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equation (7). Note that in this case the Q, may be depen- 
dent on the specific sample as well as on the selection 
probabilities. Deville and Sarndal (1992) show that any cal- 
ibration estimator has the same asymptotic variance as the 
GREG. Thus, asymptotic variance estimators for calibration 
estimators in general have the required quadratic form. 

The naive variance estimator treats imputed values as if 
they were observed values and can be written as 


AD DiI 8) 
i€A jEA 

Lemma | gives the bias of the naive variance estimator as 
an estimator for Vie: As noted earlier, the naive variance 
estimator is proposed as the estimator of V.’,,, to be as 
consistent as possible with design-based inference. An 
additional practical reason for using the naive variance 
estimator is to take advantage of existing software programs 
that estimate the sampling variance under complex samples. 


Lemma 1. Under the cell mean model with unconfounded 
sampling, response, and imputation mechanisms and the 
assumptions that 6, is an unbiased hot deck imputed linear 
estimator given es (6) and V, is an unbiased complete 
sample variance estimator given by (7), then the bias of the 
ae variance estimator, A as an estimator of V, is 


SRA SS: wa Eayafchs 2 EY Myye; pond) 


g=l eae LANE i<j 
i,jEAy 
Ms 
where Ar. =A,! Us Ay, =AyN U,, and 
vi= edad (10) 
KeAp 


For any two nonrespondents, i and j, that have the same 
donor, y,; = 1; y, = 0 otherwise. By definition, y,, = 1. 
Proof. We begin by noting that the difference between V, 
and V_ can be written as: 


~2 Z 
QQ, (5, -y;} 
icA 


+2)7 de 2599; -y:9;) 
i 
ds Q:, a -y;} 


icAy 


+2 OY a, (y,y5,-y yiy y;] 


i<j 
LEAR, jEAy 


+2 ©Y a,(y/ y5-y,9,}. 


i<j 
i,JEAy 


(11) 


Under the cell mean model, the conditional expectation of 
the first term of (11) is zero. The conditional expectation 
E.(y,¥; ~¥,¥,)= Ezly; y; ~ y,)] =0 unless respondent 
is the donor for nonrespondent j; it is thus zero when units 
i and j are in different cells and is only nonzero for one i 
and j in the same <a g. It may be represented by 
E.[ Ya Sper As d;, O. The conditional expectation 
E.( Vid; ~Y; y;) is zero unless nonresponding units 7 and 
j have the same donor, which can occur only if these units 
are in the same eel It can be represented by 
E.( Vid; ~Y; a i Oe * for i # j. Applying these results in 
equation (11) gives 


G 
E,(V,-V,|A, Aged) =2 >> ae de Qd yj Og 


g=l1 i€Ap jeAy 
is Mg 


2) Dep dai QV, THAD) 


i<j 
1,jEAy 
as 


The proof is completed by noting that since V, is 
unbiased under the design, it is also unbiased for V.',,, 
Substituting a model unbiased estimator for o., say 6, 
gives an unbiased estimator of the bias of the naive variance 
estimator. Note that whenever respondents donate their 
values to more than one nonrespondent, the last term in 
equation (12) is positive; otherwise, it is zero. 

Two simple examples illustrate applications of these 
results. Consider first the estimation of a population mean 
from a simple random sample selected with replacement. In 
this case, Q,,=n ~? and Q,, =-n*(n-1)" for L#j. 
Assume that the cell mean model holds with hot deck 
imputation and that no donor is used more than once. By 
Lemma 1, the bias of V, is -2n?(n-1)'D .m, . 
where m, = pa As as re d,, is the number of imputed 
values in cell g. In ‘this case, the bias of the naive variance 
estimator is O, (n~*) and hence is negligible for large n. 
Now suppose that every missing value in each cell is 
imputed from the same donor. In this case, with 
LD ica, Vy =m, (m,-1)/2, the bias, of « Voy is 
-n?(n-1)7 by. (m2 +m,)6,, whichis O,(n') and is 
the same order as ‘the GARE Seekers! 

As the second example, consider a simple two-stage 
sample of size n = ab, in which a clusters are selected from 
a population of A equal-sized clusters by simple random 
sampling and b of B elements are selected by simple 
random sampling within each sampled cluster. Let y,. be 
the value for y for sampled unit i in cluster a. Assume that 
the first stage sampling fraction is small enough to ignore. 
The estimate of the variance of the sample mean is of the 
form given by equation (7) where Q,, Bj 74 Behari 


for a =B, and Q,, .,=-n *(a-1)! for a4. These 
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values can now be inserted into equation (9) to compute an 
estimate of the bias. For example, suppose that all missing 
values are imputed using donors from the same cluster (the 
cells are the clusters) and no donor is used more than once. 
c, this case, oy bias of the naive variance estimator is 

ae ome o., where m, is the number of nonrespondents 
in Bias Q.. on suppose an overall cell mean model hot 
deck is used and no donor can donate more than once, but 
that donors are always chosen from different clusters than 
their missing values. In this case, the bias of the naive 
variance estimator is -2n *(a - 1) 'o? ae, m,. This two- 
stage example shows the naive variance estimator can be 
biased in either direction. In both of the cases considered, 
the bias is of lower order than the variance, and if a is large 
the bias will be negligible. 

The second component of the total variance is the 
variance due to imputation, Vive: Lemma 2 gives an 
unbiased estimator for this component with hot deck 
imputation. 


Lemma 2. Under the assumptions used in Lemma 1, an 
unbiased estimator of Viyyp 1S 
G 
2 92 
221 w 6+ 7) ww, 746 “t (13) 


i€Ay i<j 
: i,jeA 
> Me 


Pon . : 2 
where 6, 1s an unbiased estimator for O,. 


Proof. Since the variance due to imputation involves the 
squared difference between the imputed and complete 
response estimates, we begin by writing 


(6, -8,)° [5 w, (5, -») 
1EA 
= Vw; ly; -y,p 
icAy 
+23 Fw, w; (v7 -y,)(y -y). 


i<j 
Le Ay 


Noting that E. af y; ye 20; for i in cell g and, from 
above, ELOe - yyy -y,)] = E,(y; ¥; - y:¥)1'= 


YO s it ees that 
& 
Mie 2 1D woh ED wm “i (14) 
= ic¢Ay i<j 
ijeAy, 


A 


Substituting 6, a model unbiased estimator for o., 
establishes the lemma. 

Equation (14) shows that the imputation variance has 
positive contributions from each imputed value and also 
from using donors more than once. For example, suppose 
that the weights for all sampled cases are equal. The 
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contribution to the imputation variance from cell g is then 
proportional to the sum of the number of missing cases in 
the cell and the number of pairs of nonrespondents that 
receive values from the same donors. Limiting the number 
of times donors are re-used can reduce the imputation 
variance. 

The third term in the total variance is Vue Which 
previous research often considered small or negligible (e.g., 
Sarndal 1992; Deville and Sarndal 1994). Lemma 3 gives 


an unbiased estimator for Vie 


Lemma 3. Under the assumptions used in Lemma 1, an 


unbiased estimator for Vyny is 
G 
PUP? 
» > w,w,d,, - eS ch a (15) 
g=l cane JeAy, jeAy, 
Proof. Begin by writing (6, - 6 ) (6, aI) 
6 (6, -6) - Oe (6, - 6). Let 0,, be the finite population 


total, which can _ be rush as Day area y; 
ae eee eer 4, i: Using this expression, the second 
component can be expanded as 


0, (6, - 6, - 


(Ee hwy mp w[y/ -y,) 


icU-A i€Ap icAy jcAy 


In taking the conditional expectation of this product, the 
only nonzero contributions occur either when unit iin A, 
is the donor for Sie or when unit 7 in A,, in the first set of 
parentheses is unit j - the second set. In the first case, 
ENVY, 0p ead, o, for ic Ap, j € Ay . Inthe second 
case, if nonrespondent unit 7 in A,, " is the same as unit j in 
the second term, i =j, Dineen 
expectation is 0 otherwise. Thus, 


-y,)] = -o, and this 


E, (0, (8,-6,)|A, Ag d) = x vis 
he wo, =0. 


The first term can be expressed as 


6, (8, s 6, ) - 


(> wit Do "9 = wj(y, - 9] 


i€Ap 1EAyy JEAy 


Using the results for E. (y,(y;, 


ey Y wmja-% ¥ 


=1 ie€A 
§ M, 


=: y;)) given above, 


G 


Wiss De 


=] igA, jJEA 
&§ R,/ Mg 


0, (16) 


Substituting an unbiased estimator of o; proves the lemma. 


62 Brick, Kalton and Kim: Variance Estimation with Hot Deck Imputation Using a Model 


The estimator of Vy 18 zero when the weights are 
constant, or more generally when the weights of the donors 
are equal to the weights of the missing cases to which they 
are assigned. Most of the simulations in the literature (e.g., 
Sdrndal 1992; Lee, Rancourt and Sarndal 1995) have used 
simple random samples so that the estimates of the mixed 
term from the simulations are approximately equal to zero. 

To illustrate the effect of unequal weights, consider a 
stratified simple random sample selected from two equal 
size strata with replacement, and suppose that the sampling 
rate in stratum 2 is k times the rate in stratum 1. Let the 
imputation model be the overall cell mean model and let the 
hot deck procedure select donors with simple random 
sampling without replacement. For this simple situation, 
Vurx can be derived algebraically. Table 1 shows the 
percentage contribution of the mixed term to the total 
variance ( 100° 2V,,../Vzo7) for various combinations of 
strata response rates. The table illustrates the fact that when 
the sampling weights are unequal, the contribution of the 
mixed term may be important and can be either positive or 
negative. The mixed term may also be important in domain 
estimation, as discussed in the next section. 


Table 1 
Percentage Contribution of the Mixed Term to V5, 


Response rate Oversampling rate in stratum 2 


Stratum | Stratum 2" k= 2 k=4 k=6 
100% 80% 4.3 5 toe 
100 60 8.7 10.8 18.3 
100 40 1377 18.3 i 
100 20 19.9 28.8 29.7 

60 100 -15.4 -34.1 -44.5 
60 80 -10.4 “2) 4 -37.6 
60 60 -5.2 -19 -29.3 
60 40 1 -8.8 -18.2 
60 20 9.4 6.5 0 


Now consider estimating the total variance using the 
three lemmas for the hot deck estimator under the cell mean 
model. To estimate V,.,, we can either use the naive vari- 
ance estimator, with its bias as given in Lemma 1, or correct 
for the bias with a procedure similar to that recommended 
by Sarndal (1992). For a single stage sample, the bias cor- 
rection given by Lemma | is easy to apply. However, with 
multi-stage sampling the correction involving Q may be 
complicated and difficult to implement in practice. In this 
case, the naive variance estimator should produce an ade- 
quate approximation provided that the number of sampled 
clusters is large, that no donor is used too often, and that the 
percentage of missing data in each cell is not extremely 
large. 


For the other two components, the only unknown 
quantities that must be estimated from the sample are the 
cell variances, o. These parameters could be estimated 
using either unweighted observations or weighted obser- 
vations, where the weights are the selection weights. Fuller 
(2002) recommends the use of weighted observations to 
provide more robust estimates. Unbiased estimators of the 
conditional variance due to imputation and the mixed 
component are computed by substituting unbiased estimates 
of the cell variances, 6. Then, adding V» Vids, and 
DV gives an ee ai the total voice 


Vor = 20 DD mime g 


i<j 
i JEAy 


ps atearuciis (17) 


g= ee jeAy, 


To examine this estimator, we give a few simple 
examples with known solutions. All of these examples 
involve samples with equal weights so the mixed com- 
ponent is zero. First, assume simple random sampling with 
replacement, hot deck imputation under the overall cell 
mean model, and no donor used more than once. Using the 
naive variance EG for Voy, the estimated total 
variance is n tse +2n !6*(1-m'!), where sg = 
(PRP AW e= 5) r is the Heit of RgEAUTICLS, 
and K is the See of missing cases. If we ise 6° instead 
of ss (where 6” is model unbiased while s3 has a small 
sale bias), then this simplifies to r~“'6 67[1+m(r-m)n ~]. 
Taking the expectation of this estimator gives the 
unconditional variance of the without-replacement hot deck 
estimator given by Kalton (1983, page 25, 2.3.1.7). 

If a multiple cell mean model rather than an overall cell 
mean model is used, then the estimated total variance is 
n1ss +2n~ 2 On -r,), which is similar to the 
result given by Tollefson and Fuller (1992). 

Continuing with the simple random sampling example, 
now allow donors to be used more than once with the 
overall cell mean model. Again using 6” instead of sz , the 
estimated total variance is approximately 


eee nvm+) Dy): (18) 


i<j 
i,jeAy 


For fixed m, the variance in equation (18) is minimized 
when no donor is used more often than any other donor, to 
the extent possible (thereby minimizing »,. 4 een vi: 
Therefore, an imputation scheme that uses any donor at 
most once more than any other donor minimizes the total 
variance. 
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If donors are selected by simple random sampling with 
replacement, then E ly] =r "| and the expected value of 
(18) is r '6?[1+n *m(r-1)]. This is the expected 
variance of the with-replacement hot deck estimator given 
by Kalton (1983, page 26, 2.3.1.9). 

These examples show that the approach produces rea- 
sonable estimates for the total variance in simple cases and 
highlights the conditional nature of the variance estimates. 
For example, (18) is conditional on the actual number of 
times donors are used rather than on the expected number 
of times they are used (the unconditional result). The ap- 
proach is flexible enough to allow a variety of imputation 
methods, including with- and without-replacement and 
weighted and unweighted versions of the hot deck. 


4. DOMAIN ESTIMATION 


This section considers the important problem of domain 
estimation under the cell mean model with hot deck 
imputed data. Previous research on this topic is limited (Lee 
et al. 1995). The standard estimator with complete response 
for a population total for domain v_ is 6 =v. Ai ii : 
which may be alternatively expressed as 6° maid a. 
where w; =0,,w, with 6,,=1 if 1€A, and oe =0 other- 
wise. The hot deck ‘imputed estimator iS 0, = 
vicaW; 5) = ~icaW; J; Throughout we assume that 8, 
is known ‘for all icA. 

The cell mean model assumes that all the elements in a 
cell have the same distribution. In general, some elements 
in a cell may be in the domain and others not. One version 
of the model assumes a separate cell mean model for the 
domain alone and then applies an appropriate imputation 
scheme. The theory given in the previous section covers this 
case, and it will, therefore, not be discussed further here. 
While it is feasible to account for key domains in the 
imputation stage, it is impossible to consider all possible 
domains analysts may wish to study. Thus, the focus in this 
section on domains that cut across imputation cells has 
important practical implications, especially for analysis of 
public use data files. 

We now discuss the estimation of the three components 
of Voor, , the variance of an imputed domain total. Consider 
first the estimation of Vom» In the case of complete re- 
sponse, by setting y, = 0 ne ‘elements outside the domain, 
the estimated sampling variance can be expressed in the 
form of equation (7) as V, =) A, D..¥s + 202d eve 4, 24); y; 
With domain membership known for all sample elements, 
the conditional bias of the imputed variance estimator V , 
following the developments in section 3 is: : 
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7 V, [A Apd)= 2 » ya 44% 


g= ee ad JeAy, 
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i<j 
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As discussed in section 3, with large samples V, may be 
conveniently employed to estimate Vj,,, using "anand 
survey sampling variance ecemnation ‘software. It is 
interesting to note that the naive variance estimator would 
be unbiased if all the donors were from outside the domain 
(thus, dj, =0) and no donor was used more than once 
(Yj = 0). 

The derivation of lee follows directly from Lemma 2, 
where the weights are treated as constants in the conditional 


expectation. Replacing w; for w, in equation (13) gives 


ot iM, Yy “i 


icA i< 
Mey may. 
i, JEAy 
g 


Vs does not depend on whether donors come from within 
or from outside the domain. 
The derivation of aie also follows from section 3. 
plein w: for w, in equation (15) gives 
ie 7h (EE vivian d w]e 


g=l i€Ap jEA jeAy, 


G 
| pen oneness 0) i 6. (20) 


i€A jE A jE A 
Rey J Mey J Me 


Note that the mixed component is not zero for a domain 
total, even if all the original weights are equal. With equal 
weights w (but not equal w’), the contribution to Vee is 
zero when the donor is from inside the domain whereas it is 
negative when the donor is Mon. outside the domain. As a 
result, Viny = w°, L. ey. where J, is the number of 
donors fort outside the. Hees in cell g. In this case, 
ignoring the mixed component with domain estimation 
results in an overestimate of the total variance. With un- 
equal weights, the bias due to ignoring the mixed com- 
ponent can be either positive or negative. 

The total variance of a (linear) imputed domain estimator 
under the cell mean model is then estimated by 
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G 
Vior, = Vo, * 22, yw, w;' 1; Se 
= i< 
i ‘Jey, 
yy Ds Siete PeBpcluaaia 2) 
seas e JeAy | 


As an illustration, consider the case of equal weights 
within the domain (w,, = w,,) and no donor used more than 
once. In this case, the second term on the right in (21) is 
zero and the third term reflects the variance increase from 
imputation. If all the missing values are imputed using 
donors from the domain, then the third term is 2w, dm,, 6. 
where m,, is the number of missing items in cell g and 
domain v. On the other hand, if no units are imputed from 
within the domain, then this term is zero. Thus, the total 
variance is minimized when the donors are selected from 
outside the domain rather than from within the domain. 
This result occurs because imputing from outside the 
domain in effect substitutes a new value for a missing value 
for domain estimation, thus maintaining the original domain 
sample size. On the other hand, imputing from within the 
domain does not increase domain sample size and there is 
also a penalty to the variance from reusing a domain 
respondent’s value for the nonrespondent. 

If the distribution of y varies by domain (i.e., the 
imputation model is misspecified), then choosing donors 
from outside the domain results in biased estimates. Since 
all models are misspecified to some degree, it is therefore 
generally unwise to intentionally select donors from outside 
the domain in order to minimize the variance. 


5. SIMULATION STUDY 


A small simulation study was performed to examine the 
model-assisted variance estimates for estimating an overall 
total and a domain total. A sample of 40 clusters with exact- 
ly 5 units in each cluster was selected from an infinite su- 
perpopulation, where y_, is the study variable for unit i in 
cluster a. The y-values were generated from y,,=ta,+eé,,, 
where a, and e,, are independent random draws from the 
standard normal distribution. Thus, the y-values have mean 
zero, variance (t? + 1), and correlation p=77/(1 +77) if 
the units are from the same cluster and p = 0 otherwise. 
Values of t = 0 and t =0.5 were chosen, giving correla- 
tions of 0 and 0.2, respectively. The value, p = 0.2, was 
chosen to illustrate the effect of a high intraclass correla- 
tion. In addition to the y-variable, an indicator variable for 
domain v was generated by independent sampling with the 
probability of being in the domain of 0.25. Respondents 
were selected from the full sample using a uniform response 


probability of 0.6 and missing values were imputed using a 
single-cell with-replacement hot deck. A total of 5,000 
Monte Carlo samples was selected. 

The simulated point estimators for the overall total and 
the domain total are unbiased. The means and biases of the 
model-assisted variance estimators Veber) are given in 
Table 2 (the tabulated values are divided by N*10%). 
When p = 0, the relative biases of the variance estimators 
for the overall and domain totals are very small. On the 
other hand, when p = 0.2, the variance estimators have 
negative relative biases that are not negligible (a relative 
bias of -13% for the overall total and -5% for the domain 
total). To identify the source of the bias, Table 2 also gives 
the means and biases of the three variance components. The 
tabled values show that V,,,, and V,;,, are approximately 
unbiased, and it is only Vo Phat has a cor -negligible bias. 

When p =0 the cell mean model holds and V, 
unbiased as expected under the theory. When p = 02, - 
correlation of the y-values within clusters implies that the 
cell mean model assumption does not hold. The imputation 
procedure replaces some missing values using donors from 
outside the cluster, causing V, to underestimate the sam- 
pling variance due to the underestimation of the intraclass 
correlation. In this particular situation, the model failures do 
not result in biased estimates for the other two components. 
However, these components could be biased under other 
types of model failure. The simulation illustrates the 
dependence of the model-assisted estimators on the model 
assumptions and this is discussed further in the next section. 


Table 2 
Mean and Bias of Simulated Variance Estimators, with Cluster 
Sampling of 40 Clusters with 5 Elements and Response 
Rate of 60 Percent* 


Vior Vo Vie Vine 
Estimate p Mean Bias Mean Bias Mean Bias Mean Bias 
: O 104 -0.5 50 -1.9 54 -1 0 12 
120) AGfoTti ie sleaiO HReg CaM eho aes yy Ctgemltne 
be 0 16 =()sliepenl 2 0:37 al 0 -4 -0.2 
2 Ov aes -1 16 aike ya 0.1 -4 -0.1 


* The values in the table are actual values divided by N* 10* 


6. DISCUSSION 


This paper describes a method for estimating the vari- 
ance of a survey estimate when some of the values are im- 
puted using hot deck imputation. The method uses a model- 
assisted approach and conditions on indices for sample 
members, respondents, and hot deck donors. The approach 
extends the work of Deville and Sarndal (1994) to variance 
estimation for hot deck imputation, probably the most 
widely used method of imputation in household surveys. 
The proposed variance estimator is valid for a general 
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sample design and for a variety of estimation procedures 
under the superpopulation model and unconfounded as- 
sumptions. The paper also extends the previous work by 
handling stochastic rather than deterministic imputation and 
giving conditions for the bias of the naive variance 
estimator as an estimator of V,,,, to be small. 

The results focus attention on the need to take the mixed 
component into account when the sample elements have 
unequal weights. In particular, since domain estimates can 
be treated by assigning adjusted weights of zero for sample 
elements not in the domain, the mixed term needs to be 
taken into account in estimating the variance of domain 
imputed estimates even if the original weights were equal. 
Other statistics can also be covered by the approach used 
for domain estimates. For example, for the simple regres- 
sion of y on x, with y including hot deck imputed values and 
x complete, the regression coefficient can be expressed as 
a weighted linear combination of the y’s: b= 
w(x, ~x)y/Lw; (x; -x)?=Yw,'y,, where w,’ = 

WwW. (x =ty Jy ye {see i ae Also the difference between 
ve domain estimates, 6. and 6, 5, can be expressed as 
Oa Oar wit eee where w,’ = w, 
Pi i age = -w, for i€ v2, and w,’ = 0 for i € vl Uv2. 

The last example, involving the difference between 
domain estimates where imputation cells cut across 
domains, highlights the importance of the model in the 
imputation process. In this example, the analytic interest in 
the difference between the domain statistics is incompatible 
with an imputation model that assumes no difference in 
y-distribution across domains within imputation cells. By 
imputing across domains with a hot deck cell imputation 
scheme, the sample domain means for y will be brought 
closer together, thus decreasing the estimate of the differ- 
ence. Thus, a good imputation model is crucial for pro- 
ducing valid point estimates. 

The model-assisted approach to variance estimation with 
imputed data described here assumes a linear estimator, but 
smooth nonlinear functions can also be included using a 
Taylor series approximation. Like the Rao and Shao (1992) 
adjusted jackknife method, the model-assisted method is 
applicable with general sample designs and estimation 
schemes. However, the adjusted jackknife method is 
applicable only with a weighted hot-deck whereas, as a 
result of its model assumptions, the model-assisted method 
can be employed with a variety of hot deck methods, 
including choosing donors with equal probability and with 
probabilities proportional to their weights. The model- 
assisted method of variance estimation could also be 
extended to other imputation schemes such as nearest 
neighbor imputation and fractional hot deck imputation 
(Kalton and Kish 1984; Fay 1996; Kim 2000), a technique 
which reduces the variance due to imputation. 
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Implementation of the model-assisted method with hot 
deck imputation requires the availability of the information 
needed to compute the three components of the total 
variance. Standard survey sampling variance estimation 
software can be used to compute an estimate of V, that is 
approximately unbiased with large samples, but as the 
simulation study illustrates the estimate may be biased if the 
cell mean model does not hold. The computations of the 
other components require information on the identity of the 
donor for each imputed value and of the imputation cell 
membership of all sample members. From this information, 4 
and Y,, can be determined. In addition, an estimate of o, is 
required. 

While the theory given above applies to variance 
estimation with many sample designs, including multi-stage 
samples, there are serious concerns about the validity of the 
imputation model in many cases. In the case of multi-stage 
sampling, the means of many survey variables differ across 
PSUs, yet hot deck cells are seldom formed within PSUs. 
Rather they are constructed in terms of other variables that 
cut across PSUs. Even within these cells there may be 
differences in means between PSUs. These differences may 
be offsetting to some extent and not introduce substantial 
biases for point estimation. However, their effect on 
variance estimation may be more significant. As indicated 
in the simulation, failure of the assumptions may have a 
greater impact on second order statistics than first order 
Statistics. This issue merits more detailed investigation. 

Imputation is more difficult when the goal is estimating 
a function of more than one variable with missing values. 
To produce an unbiased estimate of a parameter that 
involves several variables subject to imputation requires the 
development of an appropriate multivariate model and an 
imputation procedure consistent with that model. Given an 
appropriate model and a hot deck imputation that is 
consistent with it, the model-assisted approach to variance 
estimation can then be implemented. However, estimating 
the variance becomes considerably more complex with 
multivariate estimates. The development of practical 
methods of imputation and variance estimation for this 
situation is much needed. 
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Domain Estimation Using Linear Regression 


MICHAEL A. HIDIROGLOU and ZDENEK PATAK ' 


ABSTRACT 


One of the main objectives of a sample survey is the computation of estimates of means and totals for specific domains of 
interest. Domains are determined either before the survey is carried out (primary domains) or after it has been carried out 
(secondary domains). The reliability of the associated estimates depends on the variability of the sample size as well as on 
the y-variables of interest. This variability cannot be controlled in the absence of auxiliary information for subgroups of the 
population. However, if auxiliary information is available, the estimated reliability of the resulting estimates can be 
controlled to some extent. In this paper, we study the potential improvements in terms of the reliability of domain estimates 
that use auxiliary information. The properties (bias, coverage, efficiency) of various estimators that use auxiliary 


information are compared using a conditional approach. 


KEY WORDS : Domain estimation; Auxiliary data; Conditional properties. 


1. INTRODUCTION 


One of the main objectives of a sample survey is to 
compute estimates of means and totals of a number of 
characteristics associated with the units of a finite 
population U. The data are often used for analytic studies 
such as the comparison of means and totals for subgroups of 
the population. Such subgroups are referred to as domains of 
study. Hartley’s (1959) paper is one of the first attempts to 
unify the theory of domain estimation. Hartley provided the 
theory for a number of sample designs where domain 
estimation was of interest. His paper mostly discussed 
estimators that did not make use of auxiliary information. 
He did, however, consider the case of the ratio estimator 
where population totals were known for the domains. The 
use of auxiliary data in the context of domain estimation has 
been discussed in a number of articles. Sarmdal, Swensson 
and Wretman (1992) provided a unified treatment of 
domain estimation with auxiliary data. Estevao, Hidiroglou 
and Sarndal (1995) were the first to recognize that the 
weights accounting for auxiliary data could be domain 
dependent or not domain dependent. Estevao and Sarndal 
(1999) discussed desirable properties of regression esti- 
mators of domain totals using auxiliary data. 

The existence of multivariate auxiliary data raises a 
number of questions in the context of domain estimation. 
Some of those questions are as follows. What is the effect of 
having auxiliary information that is not known on a popu- 
lation basis for the given domain of interest? How do we 
compute valid variance estimates in the context of domain 
estimators that use auxiliary data? If more than one esti- 
mator is possible for point estimation and/or variance esti- 
mation, what criteria should be used to choose the best 


estimator? Durbin (1969) supported the use of conditional 
inference to do such comparisons. He stated, “If the sample 
size is determined by a random mechanism and one happens 
to get a large sample, one knows perfectly well that the 
quantities of interest are measured more accurately than 
they would have been if the sample size had happened to be 
small. It seems self evident that one should use the infor- 
mation available on sample size in the interpretation of the 
result. To average over variations in sample size which 
might have occurred but did not occur, when in fact the 
sample size is exactly known, seems quite wrong from the 
standpoint of the analysis of the data actually observed”. 
Holt and Smith (1979) favored conditional inference, and 
applied it to study the properties of the post-stratified esti- 
mator, given simple random sampling. Rao (1985) intro- 
duced the idea of “recognizable subsets” of the population 
to formalize the conditioning process. Recognizable subsets 
are defined after the sample has been drawn. In the case of 
domain estimation the number of units belonging to a par- 
ticular domain is a random variable. Recognizable subsets in 
that context are those where the sample size is fixed within 
each domain. Comparison of the conditional statistical prop- 
erties (i.e., bias, mean squared error) of the different esti- 
mators can then be based on these subsets. The conditioning 
process assumes that population totals are known for each 
domain. In the case of simple random sampling, the number 
of units in the population domain is assumed known. 

The main purpose of this paper is to study the un- 
conditional and conditional properties of a number of 
domain estimators of totals in the presence of auxiliary data 
in the context of simple random sampling without 
replacement (SRSWOR). These conditional properties will 
be established by conditioning on fixed sample sizes within 
each domain. 
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The paper is organized as follows. In section 2 we will 
introduce several estimators of domain totals. Their 
unconditional and conditional properties are provided in 
section 3. In section 4, we will present the results of a 
simulation study for the case of the ratio estimator of 
domain totals, and provide some concluding remarks in 
section 5. 


2. ESTIMATORS OF DOMAIN TOTALS 


We first introduce some notation to set up the 
framework, under which we will be assessing the 
Pees of various estimators of domain totals. Let 
oer Se aN } denote the finite population. A sample 
Saks ait on this population using a sampling plan 
P(s). Let the first and second order inclusion probabilities be 
given by a, and z,,. The domain total Y;= 2, y, is the 
parameter of interest for a variable “y’. A domain U, 
(d=1,...,D) is any subpopulation me U, for which a 
separate estimate may be required, before or after the 
planning stage. The number of population units in domain 
U, is denoted N, and N=>?,N, for D mutually 
exclusive and exhaustive domains spanning the entire 
population. The sample s is correspondingly divided into D 
dOmMaAINS Se 7b oy See WHS” Se =" as. Ene 
realized sample size within s, is a random variable that we 
denote n,. Note that the sum of the n,’s over non- 
overlapping and exhaustive domains of the sample equals 
n. An estimator of the domain total Y, = Ly, y, that does 
not use auxiliary data is given’ by Ve = 
Ds, We Ve = Lis We Vax Where w,=7,',and yy is equal 
to y, if keU, and 0 otherwise. 

Auxiliary information in the form of a p-dimensional 
vector x may be available at different levels of aggregation. 
It may be known for each unit in the population, or for 
subsets U es U (g =1,...,G) of the population U that may 
coincide with the domains U,. We denote such known 
totals X,=2y X,; they are estimated by x HT = 
Die, Viel. modified set of weights Ww, incorporating the 
auxiliary data can be computed using either calibration or 
linear regression procedures (LR). We chose the LR 
approach. In the case of G population groups, the LR 
estimator is given by 


A A G A A 

Vay »y (X, —X, ur) B, Git) 
where B, = (Ls, Wx Xy OP Oe WX, Ye Cz, and Cc; 
are BRAKE positive constants. The use of auxiliary data in 
the domain context offers a wide range of choices for 
various levels at which auxiliary totals are used and 
regression models are constructed. To simplify matters, we 
assume that g=1 (e.g.: a single group U), yielding the 


simple regression estimator Y,, =Yy, +(X—Xyr)B, 
where Xyp =D, W;X;- 

We consider six estimators for estimating the domain 
population total Y, . These estimators are based on whether 
we use the domain totals X,, or the population total X, and 
whether we construct the regression estimator at the domain 
or at the population levels. The estimators are categorized 
into Horvitz-Thompson and “Hajek” types. We provide an 
example of the ratio estimator that is associated with each of 
these estimators. 


2.1 Horvitz-Thompson Type Estimators 


Case 1 


We assume that the auxiliary information x, is available 
at the population level U, X=, x, and that the domain 
specific y, variables are regressed on x,, keU. The 
resulting population regression parameter B,, = 
(Sx, X,/c,) LyX; Va lc, is estimated by B,, = 
(<, w,X, X,/c,) DL, WX, Yq /c, and the resulting 
estimator of the population total Y, is 


A 


SP rita 5 pst Xyr) Bia. (2.1) 
Example: The domain ratio estimator given by Y? RAT = 
XR, a> Where R, i= =Y aut / x yr- Lhis estimator was first 
suggested by Hidiroglou (1991), and is discussed in more 
detail in Estevao et al. (1995). 

If the auxiliary data totals are available at the domain 
level, X, =X, X,, then two possible estimators of Y, 
(cases 2 and 3) can be constructed, depending on how the 
population regression parameter is estimated. 


Case 2 


The population regression parameter 


ia 
B,, 2(y, xX, x; /c. | Has (x, ¥,/C,) 


is estimated by regressing y, on x, for each domain U , 
separately. Its estimator is given by 


aA , = 
BL, = bo Wy Xx x, /c. | ay (w, xX; Va Cesds 


and the resulting regression estimator of a domain total is 


oa = Vp ar t(Xy - Kove Bs; (2.2) 


where X d=s WX With x,, defined similarly to y,,. 


Example: The Horvitz-Thompson post-stratified estimator 
given by Yq postr = X q@Roq» where Rog =Ya ur /X aur: 
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Case 3 


The population regression parameter 


, = 
B, =(5 (x, x,,/c,)) ae (x, aa) 
is estimated by regressing y, on x, using all units in U. 
The corresponding estimator is 


B, =(5 W, XX; /C, ya (w, Xi yy [ops 


resulting in the regression estimator 


ie us eee ttX ye ep) Boe (2.3) 


Example: The alternate ratio estimator given by Y 4.ALTR = 
eee Oe —X jy47)R;, where R, RASH Glee 


2.2 Hajek Type Estimators 


Estimators (2.1)-(2.3) belong to the Horvitz-Thompson 
family. If the known population domain size N, is also 
incorporated in the estimation, then we get the “Hajek” 
versions of the previously defined Horvitz-Thompson 
regression estimators. The Hajek regression estimators are 
obtained by replacing Y, 47, Xgyr,and Xj, by 


Yh (v, /N, yr Your? Xana = (v,/N4)X, 
and 
5 Cees Nba ee 
where N d= Ls, YW, and N= >, W,. The estimators are 
nearly conditionally unbiased for a given n,, whereas their 
Horvitz-Thompson counterparts do not have this property. 


The “B ”s contained within the Hajek regression estimators 
correspond exactly to their Horvitz-Thompson counterparts. 
Case 4 

Yistn =Yina +(X-X fe ke (2.4) 


Example: The staal ratio estimator given by Y. ZRAT = 
Te HA a Oe X ya Ria: 
Case 5 

Ven =Y, Ha + (Xq =. ay BL 2 (2.5) 


Example: The Hajek post-stratified ratio estimator given by 
ie POSTR = =e wa (XG =a 4.HA ieee This estimator is 
identical to the Rocyibatlonipsor post-stratified estimator. 
Case 6 


MeN = Va oe (X, Xana) B,. 


Example: The Hajek alternate ratio estimator given by 
Yaactr = Yana + (Xa — Xana) 3. 


(2.6) 
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3. PROPERTIES OF THE DOMAIN 
ESTIMATORS 
Estimators (2.1) - (2.6) may be expressed as: 
ve =>, Wet Var =>. Wa Yar (27) 


where a,, 1S an adjustment factor that may or may not be 
domain dependent. The product of the design weight w, 
and the adjustment factor a,, is known as the regression 
weight (or calibration weight) w,,. Tables 1 and 2 provide 
a summary of these factors, as well as the residuals required 
for unconditional variance estimation. The population and 
sample residuals are denoted as FE, and e,,. The indicator 
variable 6, is equal to one if ke U, and zero otherwise. 

The approximate population and corresponding esti- 
mated variances of the Horvitz-Thompson estimators 
Yap (j =1, 2, 3) are 


V Gso,)@DEy bu ( Z| ZH 


(2.8) 


and 


lis SD. Au| tate wt [sae (2.9) 


Ty fe 


where A,,=Ty—T,%,3 %,=Pr{k,fes} with the 
appropriate E, ’s, @,°S, a, Ss defined in Table 1. 

The approximate unconditional population and corre- 
spondding estimated variances of the Hajek-type estimators 
us (j = i¥ 2, 3) are 


Ds A, = ayy Ex IN 4 ne }. 
Ty 


En = we Ex IN 4 8x 
Ld 


for j=1 


cs co 


for j=2,3 (2.10) 


70 Hidiroglou and Patak: Domain Estimation Using Linear Regression 


Table 1 
Adjustment Factors and Residuals for Horvitz-Thompson Regression Estimators 
Estimator Domain Dependent Adjustment Factor: dj, Residuals 
pe En =V5 -X.0 
is A w XX x dh Ndk — Sk ad 
Tite No 1+(X-Xpr)’ [yy Ak y a 
Ck Ck Ca = Yar —Xx Bia 
Ne Eu = Yar ~Xa B 
" “ w,X,X x dk = Yak — Xak 2d 
Yi tr, Yes Sazp| 1+ (Xy -%ian){Y, kasha k ) Xk nt 
oT Mack Ck Car = Yak — Xap Bog 
P\e! joe = y x B 
R . WEXpEX, |) XE dk = Yak ~ Xak3 
Ya tn Yes Sat (Xa -—Xaur {X. ad ae e ee 
CK Ck Ca = Yar — Xan Bs 
Table 2 
Adjustment Factors and Residuals for the Hajek-type Estimators 
Estimator Domain Dependent Adjustment Factor: a4, Residuals 
, > Newel Ee = Ve oe xX; B 
fe ‘ w.X,X x dk = Ydk — XP id 
Ya tr, Yes oi (Ra Xba > ad ene pee 
d = MEG Cx Cak = Yar — Xx Big 
=] , 
AS N rn i wx .X, x Ea = Yak —Xax Bog 
Ya er, Yes 05 ee eee Pea! ps cod ei see: 
Ny Ate Sk Ck Ca = Yar —Xak Brg 
a 7 | xB 
a N m w.Xx.X, x dk = Vdk ~ Xap3 
Ya 0r, Yes 8 ak aoe +(X, eRe, | bs EE | = ty pemes 
Ni Cr Ck Cak = Yar —Xax Bs 
and bn ie ( = J 
(x,) =|Lix, —Xy, 
v(Z e i No. Aw alae € dk [2x 
fr, : ~ 
Th; Tl, yields Y, ,, . 


forj=1,2,3 (2.11) — Proof: We first show how to arrive at the Hajek form of the 
where Er =Yy, Eqy/N,. The appropriate Ey’s, ¢y’S, regression estimator. Defining the auxiliary data vector z, 
and a,,’s are defined in Table 2. Note that the form of the aS Z, = (Xox, X;), the regression estimator is 
estimated unconditional variance is the same for both the 
Horvitz-Thompson and the Hajek-type estimators. 


, 


Z 


2 =n +(Z 57 B 


Result 3.1: The Hdjek-type regression estimator can be Where 


; : : es 
obtained as a by-product of the regression of y, on Be (> w,2,2,/¢,) (> WZ Yk /Ci ) 
oiy'=(1,6, -%,) J, 
ZL=% 2, and Lyp =X wp Z,- 7 r 
where X,, = N' Sy x,. The resulting regression vector is If x5; =, Y,. is’ exactly equivalent to. Yo = 2 
B =(8:.8, ), Decomposing B, as 
where HE =(8,., ) ‘ 
, =I A eet 
B, -((X. We (x, +x, )(x, ~x,) Kee )) x WE shave that Kies = NB, + Dy X; B,. where By =Y, — 
x, B, and 
>, (x, —K, jy, /c,) ae ae 
a ee ey nt LS n A ~ Dae w, (x, — X, Xx, —X,) 
and B) =a ey x, IBY, with y, =Y,y,/N_ and bo =| x 
*, =X,,/N. Ct 
The regression estimator of total Y,, =NB; is equal to Pa ~ 
. ee A : os w;, ( x Ny, 2 
the Hajek form Y,, =Y,, +(X—X,,)B,. The various : 
Hajek-type domain regression estimators can be obtained Ck 


using this approach. For instance, regressing y,, on Hence, the Hajek form of the regression estimator is 
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x aye 4 (xi eels pe 


Regressing y, on 
(x,) = (10s, —Xy ) ) 


yields the estimated regression vector B =( ‘ 
where 


6. {= Wy (x, — ¥, lx, *) - w,(X, —X 


Ci Cy 


and B° =F. +(x, —X,) B,. Substituting B’ into 
Y,, =NB? yields the Hajek form Y,, . 


Remark 3.1: (Additivity). Suppose that the domains U, 
are mutually exclusive (U, QU, =9 for d,#d,) and 
exhaustive (U?_ Ug UD Additivity over mish ei 
means that 7 ve br as val br, =, where 


A 


P, =Pop +(X— Xr) B. 


The additive property of rs, is desirable because a single 
set of calibration weights, w, a,,, can be used repeatedly 
to produce ad hoc domain estimates. Only two out of the six 

estimators, Y afr, and Y dtr,» ae additive over all such 
domains. 


Remark 3.2: (Calibrating on domain auxiliary data). 
Estevao et al. (1999) discussed some of the estimators 
provided in Tables 1 and 2 for the case of a single auxiliary 
variable x, . They arrived at their estimators by controlling 
on domain information, either via auxiliary variables and /or 
control totals. 

In what follows, we will assume that the sample s of size 
n has been selected using simple random sampling without 
replacement (SRSWOR) from a universe of size N. The 
estimated unconditional variance of the Horvitz-Thompson 
and Hajek-type estimators for this sampling plan is: 


Mf ae }= 


—)? 
v(¥,,, }= 5 ally) - Ff) Dus\an On =a 6) a — ay (2.12) 


where a,e=> ay at and f =n/N is the sampling 


fraction. 


3.1 Unconditional Properties 


The choice between the various regression estimators 
should be based on the level at which the auxiliary totals are 
available, as well as bias and variance. All the above esti- 
mators are asymptotically unconditionally unbiased; how- 
ever, their variances differ. We compare the unconditional 


Al 


population variances of the six domain regression estimators 
(2.1) — (2.6) by distinguishing two cases: (1) an intercept 
term is included in the regression; and (11) no intercept term 
is included in the regression. 


Result 3.2: Assume that an intercept is included in the 
regression, c, =c forallkeU , and N > p, where p refers 
to the number of auxiliary variables. The following 
inequalities hold for the population variances of the domain 
regression estimators (2.1) — (2.6): 


Vane Ma one RV Ya SVL, 7) 
may be smaller, equal or greater to V(Y, ,, ). 
(ii) Vi, ine VY, tr) and VY, eae Vir, tn) 
404 4,r,) May be smaller, equal or greater to V(Y, re 


Vg ) 


Proof. In the case of simple random supine without 
replacement, Vs, wr) = Ady (Ea, — (oe ns tore t= 12S. 
where A=N?(l-f)/(n(N-1)) and Ey = 
Yu, Ea /N. Given that the regression contains an 
intercept, it follows that >, E,, =0 or that Lu, Ea =9, 
depending on which regression estimator we use. We only 
show that (i) holds: the proof for (ii) is similar. The 
population variances for Y. afr, and YA y, are respectively 


VF, JE A> (Yar 7X, Bia)” 
and 

vv, )= A2a (Yn —X;,B,4)° - 
The population variance of ¥ ,, 18 


VI¥.,., )= a te u,) > 


where 


ie ian = (3 “\b.,- Be B,), 


with yy =Na Sa, Vax and Xy, similarly defined. 

We first ‘ae that Vy, tr, ) < VY, »,)- To this end, we 
decompose Dy (y,, —X;,B,,)° into its within domain U, 
and outside domain U; components, yielding 


Mo ae —XBia)? = iy, ae Xe Bia)” 
+ Dy, Oa — Bi)”. 
Since 
Dv, ae 7 XeBia)” = Dig, Oa — Bra)’ 
+ y, & Boz —Bia)) 


it follows that VIVE) < V (Yoon yy 
Next, we show that 
ve )svi¥, lr, ). 


The variance V(Y, ,,) can be re-expressed as 
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(Y, - x,.B,)° 


VP,» )= oer 2: 2p, —X), B,) : 


N 
The difference between Vat and ViGae) is: 


via ins) 
2 
a (y, -xi8,)*)-(32 hoy, 25,070 Bale 
gah, (Ye =< ,Bayy 


x), Je, -B,,) 


=A 


(B, -B,,) De X; 


=A N2 : 
{fe 
(B, —-B,, ) we 


a Na (53 


‘= - —> = 
2B3Xy, Yu, + B3Xy, xs) 


XX; le, -B,,) 


>A y 
—2B,x, Vy, + B3xy, x, B, } 


Noting that y,, =Xy,B,, it follows that: 
— 2B; Xy, Xy, Bo, 


if, 
+B, Xy Xy,B, 


(Pea 
= By, Xy, Xu, Bra 


= (B, -B,,) ¥y, Xp, (B,-B,,) 
Since 


, 
ie =_ —_/ — — 
Oe K,X;, —NyXy Xv, =e (x, —Xy, )x, —Xy, , 


the difference V oe ia vy d,tr,) Can be expressed as: 
VWs )-V Osea) 


ce A\(B, -B,,) oa xX NSO Je, —B,, } 


7 AB, -B,,) Dy. {x, —Xy, )ix, —Xy, ) (B, —B,, } 

20. 

Finally, we show that VY, wr, ) May be smaller, equal or 
greater to V(Y, ,,) by constructing examples: 

Gee V 7) = ee ene Bae 

Ger) = Vi eit Ba 

(ii) V(Y,,,) >V(%i,,). if the fit of y, on x, is 

much poorer than the fit y,, on x, for ke U. 

It can also be shown that VY, rae Vie wy) 3 
VY, Pies VY, y,) > and VY, tm) < vrs w,)- The esti- 
mator with the smallest variance is Y, », - However, if it is 
assumed that the B,,,’s are similar across all domains, and 


that there are very_ few observations in s,, it _ may be 
preferable to use y atr,. Lhe choice between y iy, ane 


Es , Should not always be based on the asymptotic 
variance. If there are very few observations in s, , this can 
cause significant bias in Y ,, and also cause the exact 
variance of Y. air, to be larger than that of Y ar,» SO that the 
latter may be preferred. 


Remark 3.3: If there is no intercept in the regression, then it 
does not necessarily follow that Result 3.2 holds. 


Proof. We illustrate this statement using the elementary ratio 
versions of cases 1 and 2. They are respectively the Horvitz- 
Thompson ratio estimator ve rat = Yan (X /X yt) and the 
Horvitz-Thompson post-stratified ratio estimator Y 1,POSTR = 
Your (Xq/ oe aur)» Also, suppose that the elements of the 
data vector (y,,x,) are positive for all keU. The 


population variances for Y a.RAT and Vioee are 
V (Yapostr) =A Xu, x - Boa x)" and =V(¥, par) = 


AYP OPA BY XY Pwhete BY SY 1X, ,and *BiPS 
Al Xe 
The difference V(Y;pa7)—-V Va postr) can be re- 


expressed as: 
AYy, (Bia - Boa) xt 
+2A(Biy — Boa oy, (Ye 
HAD a Bite) 


Since the second term of this expression can be positive, 
negative or zero, the difference V(Y, par) — V (Yu postr ) 
can be negative. 


— Boy X; la, 


3.2 Conditional Properties 


For a given sample s, let n, be the realized sample size 
of s,. The following result can be used to evaluate the 
conditional bias of estimators (2.1) to (2.6). 


Result 3.3: Let z, be an arbitrary p-dimensional vector, 
that is z, =(Z,,-. MERE PF and SUP that n, 21. The 
conditional expectation of z, =n >, z, given n, can be 
written as: 


wba yt te Dyer Dy, 20 


(zalnA\= 
= SE a ies os 
aah Be Rm ene ay Vg Sail? Z| 
gt yu te) 3.1) 
where Z, =N dy Z; ; Zy, aN 7 Dar Lisp ap 
W,=N,/N, pS Nes fz =n;/N; with = 


n—n,,and N;=N-N,. 
Proof. Rewriting Z, as 


1 
-(¥,, 2x ie Ziel 


we have that 


E@, 


Loh ag = its 
——— Zr Z 
22 Dw, N-N, Dw; | 


n 
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where s- ={kesandk¢s,} and 


U; ={keU andk¢U,}. 
Since Yy_ 2, =Yy 2, 
result, that is 


—Xu,Z, » we obtain the required 
je 
We 


Result 3.4: The conditional population variance of z, given 
n,, can be written as 


Je 0-fa)V 


dz s 
(Zy, —Zy). 


2 


4 “if5) Vay 


d 


ve, 


where 
oat 
] x FT U 
Pee N_-1 nee le, ; 2; )e er , 
with Z,. zy Ne Yu, Z, >and wz =1-w 


The estimator of the conditional population variance 
V(Z, |n,) 1s given by 


2 
v(z, ) ae ee +—4(1- f;) (ene? 
es Ms d 
where 
V2, = — Yb -%, fe -%,,) 
and 


if : i; / 
Z,—-Z, )\z,-Z,_}, 
hsb tle) 
with Z, ine 1 Ds a Fees nz DevZ}. 

Proof: \t follows using arguments similar to those used in 
Result 3.3. We first illustrate how Result 3.3 can be used to 
obtain the conditional bias for the simpler estimators of 
domain totals. This includes the Horvitz-Thompson esti- 
mator Va yr» aS well as post-stratified ratio estimator 
ie postR =(X q op ae) Ye yr: Let z, be the domain vari- 
able y,, . Using Result 3.3, we have that E Ce ur! N4) = 
Nw, Yy,, Where Vy =Y,/N,. The conditional bias of 


Mae given n, 1s their Bias ce ur (Nz) = 
N(w, -W,) Yy,- For the post-stratified ratio estimator, 
note that Yi posrr ~ Ya = Yy, eG d,HT° 


Defining Z, a Vy —(Y¥,/X 2 \alen we obtain that 
Bias(Y, postr Ing) = 9 

We next proceed to evaluate the conditional bias and 
variance of estimators (2.1) — (2.6). We only illustrate the 
procedure for the regression estimator Y dtr, » aS the steps are 
similar for the other estimators. Conditional on n, , the dis- 
tribution of s, is that of an SRSWOR. This means that, for 
each sample s;, n, can be considered as having been se- 
lected from N,. We express Y tr, 28 ie tn = =Ly y, + 
N/n>, @, where en = Ya —X; B,, and j, = Xx; B. 
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Following Sarndal and Hidiroglou (1989), we define the 
conditional regression vector B;, as 


P J 
Bi, -|2 [s, 22), : 
k 
als. | (3.2) 
k A 


The estimated regression vector B, z will converge to 
B,, (under appropriate conditions) in conditional design 
probability as n, and N,, increase. 

We have that 


x, xX x, xX, 
eS Ain, |-E Bae, 
k 


ne 


and 


xX, y x, y 
en, =D 
k 


nc 
where 
w, —-W 1 xX 1 xX 
R -—4 dj. * Sei k*k |= 9 
LW; Fy, Bs yeue ) 
and 
w, —W>s 1 X, Vi 1 XN 
r. =—*—_+ | — _— —— |=0. 
itis Wi; o dey, Ce Ne é 


Consequently, using Result 3.3 and assuming that 
(w, -W, )/(I-W, )=0, we have that B,, =B;,. 
Define the “conditional residual” for the k unit as 


Ee x OB (3.3) 


The deviation of Y a, trom the true value Y, can be 
written as 


Yi, -Y. Se mapas Ex +— ~>, Eh 964) 


where 


Aig -(~rx, ee | (B,, aryl 


In equation (3.4), Aj, is of lower order than N/n >, Ey, - 
To see this, note that 
<4 (%, -%,) 


BFE Dae Je ew be, 


where (w, —W,)/(1—W,,) should be close to zero. 

Also, as noted earlier, B, , —B,, is near the vector 0 in 
conditional design probability. Hence E,=yy— 
x, BL, = Ya —X; Big =E,,. This implies that we can 
write (3.4) as 


Ven, == koe tea ~ >), Ea: 


The conditional expectation of 0 —Y, 1S approx- 
imately: 


? 


(3.5) 
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A VV alice = 
E|?,, =Y,)| Ng ken Mee (Bi, Sy, (3.6) 
Ww 
where ae = Yu, Ex /N, and E, = ig EIN. 
Since y,, =W, Yy, , the conditional expectation (3.6) can 
be re-expressed as: 


El’,,, i, } n, | 


ay ea 
1-W, 
Fu, (1-w,)-(%,, -x,) B B., |: (3.7) 


The term > y 
the conditional population variance of Y, /, 
estimated value are respectively 


ns ne | Mh (\- f,) Ve, tet - fae, 


E ,, 18 constant in (3.5). Using Result 3.4, 
and its 


V oe 


saber 20-nng 


-1y'> (ihe (Ey, - Ey, * 


and 


2 
Dax Cak 
Sd ny 


Le (n; ¥ py $2 ant x py} 


The conditional bias and variances of the remaining five 
estimators can be derived similarly. Table 3 presents a sum- 
mary of these properties. The required adjustment factors 
a, and residual terms e,, are given in Tables 1 and 2. 


4. SIMULATION STUDY 


A simulation study was carried out to illustrate the 
conditional and unconditional properties of the ratio version 
of estimators (2.1) — (2.6). We studied these properties using 
a population of 1,000 bivariate observations (y,x). This 
population resulted from the concatenation of two generated 
population domains: a large domain of size 900 and a small 
domain of size 100. The ( y, x) observations were generated 
within each domain assuming a ratio model y, =Bx, + €, 
where E(e,)=0 and V(e,)=0°x,. The B coefficients 
were 1.0 and 3.0 in the large and small domains. The 
auxiliary variable x was generated using a gamma 
distribution I (a,b), where a=3 and b=16. The 
dependent variable y was also generated by a gamma 
distribution, ['(A,B) such that the parameters A and B 
Satisfied . E(,) = Bx, AB’ and “Viy,)=o%%, AB. 
After solving for A and B, we obtained A=B7/o7 and 
B=o'/®. The term o* was chosen to satisfy a set 
correlation between x and y defined by 


Bb 
Pv . 
o?b + Bb? 


The preceding equation yields the constant term 


1 
o- “0 - 
Pxy 


of the error variance. Common correlation values p, , were 
used for both domains, ranging from 0.1 to of 0.9 in steps of 
0.1, resulting in nine different populations. Random samples 
(M = 10,000) of size 250 were then repeatedly selected from 
the populations. For each sample, estimates of domain totals 
were computed using the estimators given in Table 4. We 
do not include the Hajek post-stratified estimator, Vie , as 
it corresponds exactly to its Horvitz-Thompson analogue, 
Vain: 


Table 3 
Conditional Bias and Variance of Estimators (2.1)-(2.6) 


Estimated Conditional Variance 


Estimator Conditional Bias 

Pan NW ((wg Wa /1-Wa )I 5, 0-Wa)- Bu, -¥y J Bra] 
Tare Almost 0 

"5 N (wg -W,) (Fy, ~ Xj, Bs) 

Ya ser N (wy —Wy I —Xy, )B,, /0-w, ) 

ve Almost 0 

V4 tr, Almost 0 


N* (vd /na Jf + (v3 /ng )- fz)ve, | 
Wia- f,9/na)&,, (loa eax — 40) [ny -1) 
(waa PO fadlna)D, (0a eu ~aaeF Meg -1) 
(Jw) wa na) fa)ve, + (03 /ng )O-Fa)ve, | 
(wa-far/na)D, (lea eae -20¢) Mug -1) 

(waa fad/na)S, (lea eax —20€) Ang -1) 
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Table 4 
Estimators and Associated Error Terms 


Ratio Version 


Y,-Rat = Yur Feber 


Estimator 


HT ratio: Ypin 
HT post-stratified ratio: ‘e lr, 
HT alternate ratio: le 


Hajek ratio: Vig 


Hajek alternate ratio: Yirye 


4.1 Unconditional Results 


The unconditional properties of the estimators were 
assessed using two performance measures: (i) root mean 
squared error (RMSE) and (ii) coverage rate (CR). They 
are: 

i. The RMSE is defined as 


. > (m) 2 
De eM 
m=1 


where Y,"” is the estimated total (either Horvitz- 
Thompson or Hajek type) based on sample m, and M is 
the total number of samples drawn for the simulation. 

u. The coverage rate CR for a given estimator Youis 
defined as the ratio of the number of times that the 95% 
confidence interval 


Y,” +1.96/v(¥,""”) 


contains the true population total to the number of 
replicates. We used the unconditional variances given by 
(2.12), and the error terms in Table 4 to estimate the 
required variances. 


The four graphs provided in Figures 1 and 2, summarize 
the unconditional analysis for small and large domains. Also 
shown is the impact of increasing p, , . The square root of 
the average mean squared error and coverage rates are used 
to compare the estimators. 

In Figure 1, we note that the RMSE decreases 
substantially with increasing p, , . This can be attributed to 
the decreasing dispersion of the dependent variable 
conditional on the independent variable as the correlation 
between the two increases. We also note that the spread of 
the RMSE is narrower for the large domain than for the 
small domain. The ranking of the estimators in terms of 
RMSE from worst to best is as follows: (i) HT ratio (HT 
RAT), (ii) Hajek ratio (HA RAT), (iii) HT alternate ratio 


Vapor = Koay Rae ss, | 
Yoarre = Yann + (Xa - x) eae 


Va RAE = Yann =U. Ge Kea) re ) 


Y, autr a YzHa +(Xq SO es 


Error Term 


Cak = Ydk —Rig Xk > Ria 7 Ve sty Dee 
Cak — Ydk —Ro4 Xadk >» Boy a ee res 
Cak — Vadk ~ R3x, R; VAG a 

Cak — Vdk — Rar, RG — Vaan) Xe 


Cap = Yar — Roxy ,R3 = aes 


(HT ALTR), (iv) Hajek alternate ratio (HA ALTR), and (v) 
HT post-stratified ratio (HT POSTR). This ranking is in 
agreement with Result 3.2. 

In Figure 2, we note that the unconditional coverage rates 
are similar across all the estimators regardless of the 
correlation p, , . For small domains the Horvitz-Thompson 
estimators exhibit a slight degradation in the coverage rate 
when rane is weak. But as the correlation increases, their 
coverage rate becomes comparable to the Hajek type 
estimators. The Hajek estimators have a better overall 
coverage rate than their Horvitz-Thompson counterparts. 


4.2 Conditional Results 


The conditional properties of the estimators were studied 
using: (i) average relative conditional bias and (ii) condi- 
tional coverage rates. They are defined as: 

i. ARB, =(100/M, )y@(¥\” —Y,)/Y,, where M, is 

the number of samples of size n, . 


ii. The conditional coverage rate has the same definition as 
its unconditional counterpart. The associated variance is 


where 


7, -L Sym 
(aie M, d 3 

Table 5 summarizes the conditional biases of the ratio 
versions of estimators (2.1)—(2.4) and (2.6). They were 
obtained from Table 3 using a single auxiliary variable. 

The relative conditional bias and coverage rates of the 
estimators are summarized in Figures 3, 4a, and 4b with 
respect to the realized sample size n, for large and small 
domains, and for two correlations (py, =0.90 and 
Py y =0.60 ). 
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Unconditional RMSE - Small Domain 
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Figure 1. Unconditional RMSE 
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Figure 2. Unconditional Coverage Rates 
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Figure 3. Average Relative Conditional Bias for py y = 0.90, Bg, =1.0, and By. =3.0. 
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Figure 4a. Conditional Coverage Rates for py y = 0.90, Bz, =1.0, and Bj. =3.0. 
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Figure 4b. Conditional Coverage Rates for py y = 0.60, By, =1.0, and By, =3.0. 


Conditional Biases of Ratio Versions of Estimators (2.1)-(2.4) and (2.6) 


Estimator 


HT Ratio: Y, », 


HT post-stratified ratio: y d br, 


HT alternate ratio: Y dlr, 


Hajek ratio: YG br, 


Table 5 
Conditional Bias 
Wale Sem 
Almost 0 
ioe (cee vie) (e, 7 wat /Yu, ) 
Almost 0 


Hajek alternate ratio: vs lr, 


The conditional bias presented in Figure 3 supports the 
theoretical results presented in Table 5. The three Hajek 
estimators are nearly conditionally unbiased. The magnitude 
of the conditional bias of both the HT ratio estimator and the 
HT alternate ratio estimator is in agreement with the 
theoretical conditional bias. But it should be noted that the 
conditional bias associated with the HT alternate ratio 
estimator is smaller than the one of the HT ratio estimator. 
Also, in larger domains, this conditional bias is less 
pronounced for the HT alternate ratio estimator. 


The conditional coverage rates are given in Figures 4a 
and 4b. We note that the three Hajek estimators follow 
closely the nominal 95% coverage probability. The cover- 
age rate of the HT alternate ratio estimator is reasonable in 
larger domains despite its being conditionally biased. But its 
coverage deteriorates substantially in smaller domains. The 
coverage rate of the HT ratio estimator is not acceptable. 
But it should be noted that the coverage rates of the condi- 
tionally biased estimators improve as the realized sample 
size n, approaches the expected domain sample size 


E(n,). 
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In summary, the simulation study identified the three 
Hajek estimators, Hdjek post-stratified ratio, Hdjek 
alternate ratio, and Hdjek ratio as the best estimators in 
terms of their conditional and unconditional properties. Note 
that even though the Hdjek ratio estimator uses the least 
domain auxiliary data (it uses domain population counts 
N,), its mean squared error is still reasonable. The Hajek 
post-stratified ratio is the best estimator in terms of its 
conditional and unconditional properties. 


5. CONCLUDING REMARKS 


We have studied six possible regression estimators of 
domain totals, each using various levels of auxiliary 
information at the domain and/or population level. The only 
estimator that has regression weights that are not domain 
dependent and that also have the additive property is 
Horvitz-Thompson estimator Y am: this estimator is 
constructed using auxiliary information at the population 
level: the domain dependent independent variable y,, is 
regressed on the auxiliary vector x,. However, it can be 
seriously conditionally biased and the associated confidence 
intervals can be understated. 

The Hajek-type estimators have two the disadvantages: 
(i) they do not have the additive property; and (ii) their 
associated regression weights are domain dependent. 
However, they have the best conditional properties. They 
are nearly conditionally unbiased, and the conditional 
confidence intervals associated with the estimators follow 
closely the nominal coverage rate. They also have the 
smaller unconditional MSE’s. The Hajek estimator that uses 
the least auxiliary data at the domain level is oe It 
requires domain population counts N, (d= 1,..., D), and 
the population totals X . Its conditional and unconditional 
properties are reasonable. 

The best Hajek estimator, Ven uses auxiliary infor- 
mation at the domain level. The Hajek regression type esti- 
mator Ye ;,, Can be made domain independent using a single 
set of regression weights as follows. Suppose that the most 
important domains are oS GU (¢g=1], «5G),.and that 
these domains are mutually exclusive and exhaustive. The 
resulting Hajek estimator is 


G 


ee = » ie rts (X, —X, ua ¥B,, | 


g=l 


and 


~ -1 
B,. =(y. WwW, X, x, [c, Soy WwW, X, yz /Cy- 
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Prediction of Finite Population Totals Based on the Sample Distribution 


MICHAIL SVERCHKOV and DANNY PFEFFERMANN' 


ABSTRACT 


This article studies the use of the sample distribution for the prediction of finite population totals under single-stage 
sampling. The proposed predictors employ the sample values of the target study variable, the sampling weights of the 
sample units and possibly known population values of auxiliary variables. The prediction problem is solved by estimating 
the expectation of the study values for units outside the sample as a function of the corresponding expectation under the 
sample distribution and the sampling weights. The prediction mean square error is estimated by a combination of an inverse 
sampling procedure and a re-sampling method. An interesting outcome of the present analysis is that several familiar 
estimators in common use are shown to be special cases of the proposed approach, thus providing them a new interpretation. 
The performance of the new and some old predictors in common use is evaluated and compared by a Monte Carlo 


simulation study using a real data set. 


KEY WORDS: Bootstrap; Design consistency; Informative sampling; Sample-complement distribution. 


1. INTRODUCTION 


The sample distribution is the parametric distribution of 
the outcome values for units included in the sample. This 
distribution is different from the population distribution if 
the sample selection probabilities are correlated with the 
values of the study variable even when conditioning on the 
values of concomitant variables included in the population 
model. It is also different from the randomization (design) 
distribution that accounts for all the possible sample 
selections with the population values held fixed. The sample 
distribution is defined and discussed with examples in 
Pfeffermann, Krieger and Rinott (1998), and is further 
investigated in Pfeffermann and Sverchkov (1999) who use 
it for the estimation of linear regression models. Krieger and 
Pfeffermann (1997) use the sample distribution for testing 
population distribution functions and Pfeffermann and 
Sverchkov (2003a) discuss its use for fitting Generalized 
Linear Models. Chambers, Dorfman and Sverchkov (2003) 
utilize the sample distribution for nonparametric estimation 
of regression models, and Kim (2002) and Pfeffermann and 
Sverchkov (2003b) apply it for small area estimation 
problems. 

In this article we study the use of the sample distribution 
for the prediction of finite population totals under single- 
stage sampling. It is assumed that the population outcome 
values (the y-values) are random realizations from some 
distribution that conditions on known values of auxiliary 
variables (the x-values). The problem considered is the 
prediction of the population total Y based on the sample 
y-values, the sampling weights for units in the sample and 
the population x-values. The use of the sample distribution 


permits conditioning on all these values, which is not 
possible under the randomization (design) distribution, and 
the prediction of Y is equivalent therefore to the prediction 
of the y-values for units outside the sample. 

The prediction problem is solved by estimating the 
conditional expectation of the y-values (given the x-values) 
for units outside the sample as a function of the conditional 
sample expectation (the expectation under the sample 
distribution) and the sampling weights. The prediction mean 
square error is estimated by a combination of an inverse 
sampling procedure and a re-sampling method. As it turns 
out, several familiar estimators in common use and in 
particular, classical design based estimators are special cases 
of the proposed procedure, thus providing them a new 
interpretation. The performance of the new and old 
predictors is evaluated and compared by mean of a Monte 
Carlo simulation study using a real data set. 


2. THE SAMPLE AND SAMPLE-COMPLEMENT 
DISTRIBUTIONS 


2.1 The Sample Distribution 


Suppose that the population values {y,X}= 
{(y,---Yy )[X,--Xy]'} are random realizations with con- 
ditional probability density function (pdf) f,,(y;!x;) that 
may be discrete or continuous. The y-values are assumed to 
be scalars but the x-values can be vectors. We consider 
single stage sampling with sample inclusion probabilities 
m, =Pr(ie s)=g(y,X,Z,i) for some function g, where Z 
defines the population values of design variables used for 
the sampling process. Note that the y-values are random and 
we also consider the design variables as random so that the 
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g-values are random as well. Let J; =1 if ie s and J; =0, 
if i¢ s. The conditional marginal sample pdf is defined as, 


def 
f,0iK)) = SOAS 1) 

‘ Prd; = 1| ypX)f,(y| X;) 

‘i Peiy ills) 
with the second equality obtained by application of Bayes 
theorem. Note that Pr(/; =11 y,,x;) 1s not necessarily the 
same as the actual sample selection probability a, = 
e(y,X,Z,i) (see Remark 1 below). It follows from (2.1) 
that the population and sample pdfs are different, unless 
Pr, =11 y,,x;)=PrU; =11x,) for all y,. When the 
sample distribution differs from the population distribution 


it becomes informative, and the sampling scheme can not be 
ignored at the inference process. 


(2.1) 


Remark 1. It is important to emphasize that the definition 
and use of the sample distribution does not assume that the 
sample selection probabilities are function of only (y,,x;). 
As mentioned earlier and highlighted by expressing the 
selection probabilities as 7m, = g(y,X,Z,i), the actual 
selection probabilities may depend on all the population 
values (y,X,Z). However, as shown in Pfeffermann and 
Sverchkov (1999), E(t; Pye xX y= tt GU, = tly. x. ). 
Thus, although the selection probabilities may depend on all 
the population values (y, X,Z), for given values (y,,x; ) 
they equal Pr(/; =11 y;,x;) ‘on average’. In fact, 7, may 
not depend directly on y at all and only be a function of 
(X,Z), and still the expectation E (1; | y;,*;) equals 
Pr(J; =1! y,,x;). The reason why the expectation may 
depend on y, in this case is that Z may be correlated with y. 
For example, the 1999 Canadian Workplace and Employee 
Survey uses a disproportionate stratified sample with the 
strata defined by region, activity, and the size of the 
workplace. The size information is obtained from tax 
records from 1998; see, Patak, Hidiroglou and Lavallée 
(2000) for details. When modeling the payrolls in 1999 
against the number of employees, the sampling design is 
found to be informative, which is explained by the fact that 
the stratification is based in part on the size obtained from 
the tax records in the previous year, which are correlated 
with the payroll the year after. See Fuller (2003) for details 
of the analysis. 

The discussion above should not be understood to mean 
that 2, is never a function of (y,,x;) only. A classical 
example for the latter case is retrospective sampling. Thus, 
in a case control study, the selection probabilities of the 
cases and controls usually only depend on the respective y 
and x values (and often just on the y values). In the 
empirical study of this paper we use a real data set where the 
sample was drawn by a disproportionate stratified sample 


with the strata boundaries defined by the values of the 
dependent variable. 

In what follows we regard the probabilities 7, as random 
realizations of the random variable g(y,X,Z,i). Let 
w, =1/n, define the sampling weight of unit 7. The 
following relationships, established in Pfeffermann and 
Sverchkov (1999) hold for general pairs of vector random 
variables (u;,v;), with E, and E, defining expectations 
under the population and sample pdfs respectively. (As a 
special case, u; = y;, V; =X;). 


f,(u,|v;) = QZ) 
| E(u; V,) 
E,(w;|u;,v;)f, u,|V;) 
ly) 4S 25 
ye E,(wjv,) ne 
r 4 E, (w;u;|V;) (2.4) 
p(u,|V;) = mer : 
It follows from (2.4) that 
1 
E_(w.|\v,) =——————__; 
: ea 
_ E,(wu,;) . 
ee ag 
C) 5(W;) Ch (2.5) 


For a detailed discussion of the sample distribution with 
illustrations, see Pfeffermann et al. (1998). 


2.2 The Sample-Complement Distribution 


Similar to (2.1), we define the conditional pdf for units 
outside the sample as, 


def 
FEY = FS HRE FE 0) 
Pr = Oy XI FOX) 6) 
Pr; = 0|x;) 
The relationships (2.2)—(2.5) and the equality 


Prd = OVU.V;)) =) =e eel ey a 
1—E’, (a, !u;,v;) imply the following representations of 
the sample-complement distribution for general pairs of 
vector random variables (u,, V; ). 


fu, =e ne lf, aul) 
E,(d-2,)|v;] 


E [d-2;)ju,.v;] E [x,|v;] 
Tah ARS 9] 30) DSR [any he Matyas 
E,{d-2,)|v;] E,(7;|u jf (uilys) (2.7) 


a 
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2.8 
E(w; -D\v;] ee) 


f.(u;lv;) 7 


(Equation (2.8) follows by application of (2.5a) to the 
second expression in (2.7)). Also, by (2.8) and the first 
equation in (2.7), 


E.(u)\v,)= E,ld—m uv] _ E,lo%-Dul¥il 5 
SEC pli~ hy E,[(w, -)\v,) 


Remark 2. In practical applications the sampling fraction is 
often very small and hence the sample selection proba- 
bilities are small for at least most of the population units. If 
m; <0 with probability 1, 


f.(u;lv;) = E,(d-2,)|v;] 


=F, 


Ee, (m;|v;) = T; J[u;, V; age (u,|v;) 
E,{d-2,)lv;] 


= f,(u; | v;)d+A) (2.10) 


where —6 < A < 6/(1—5). It follows from (2.10) that for 6 
sufficiently small, the difference between the population pdf 
and the sample-complement pdf is accordingly small, which 
is not surprising. 


3. OPTIMAL PREDICTION OF FINITE 
POPULATION TOTALS 


Let Y=»>%,y, define the population total. The problem 
considered is how to predict Y based on the sample data and 
possibly population values of auxiliary variables. Denote the 
‘design information’ available for prediction by 103 
{(y;.W;),1€ 5; (X,;,J,), J =1..N} and let Y= PD.) 
define the predictor. The MSE of Y with respect to the 
population pdf given D, is, 


MSE(Y | D,) = ssid ee Y)’|D,] 
=E, {IY - 


(3.1) 


if 


since [Y — E,(Y\D,)I is fixed given D,. It follows from 
(3.1) that MSE(? | D ,) 18 minimized kee Y= EY Dye 
The latter expectation can be decomposed as, 


81 
EID.) = ue yD.) 
= YE, (Dt, =D+ LE, (y|D,.1; =9) 
= jes 
sterols (3.2) 


ies jJés 
where in the last equality we assume that y, for j¢ s and 
D, are uncorrelated given X;. The prediction problem 
reduces therefore to the estimation of the expectations 
E.(y,!X,). In section 4 we consider semi-parametric 
estimation of these expectations. 


4. SEMI-PARAMETRIC PREDICTION OF FINITE 
POPULATION TOTALS 


Suppose that the sample-complement model takes the 
form, 
y; =Cy(X,) +€;, 
E.(e ;|x;) =), E.(&; x,)= o°v(x ;); 
»X,)=0,k#j (4.1) 
where C p(X) is a known (possibly nonlinear) function of x 


that depends on an unknown vector parameter B. The 
variances 0° V(X > are assumed known except for o”. 


E (EE; 


Remark 3. In actual applications the model (4.1) can be 
identified by a two-step procedure, utilizing the equality 
E..(y,1k=E,(y,|x;) with r,=(w; -))/E,[(w; -DIx; ] 
(follows from Equation 2.9). First, estimate E,(w,|x,) and 
hence 7, by regressing w, against x, using the sample data. 
Let 7, =(w, -1)/[E,(w,|x,)—1] and transform y; = 7,y,. 
Second, study the relationship in the sample between y, 
and x, for identifying the form of C,(x;). See 
Pfeffermann and Sverchkov (1999, 2003a) for examples of 
estimating E,(w,|x,). A similar procedure can be applied 
for identifying the variance function v(x,), using the 
empirical residuals @, = y, — E,(f,y;!x,). 

The function C,(x,;) in (4.1) with the true vector 

— Cg(x 


parameter B satisfies for all j¢ s, 
C.(x,) E. Ly; | 
X ,) =arg min Xt 
res (x,) v(x ,) | 
. Ly, —Cy(x I 
SAO NU Ee aa 
CG) v(x ;) 


(The second equality follows from (2.9)). Hence, by 
substituting the sample expectation outside the curved 
brackets by the sample mean (a straightforward application 


(4.2) 
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of the method of moments) and estimating 7, by 7, (see 
Remark 3), the vector 8 can be estimated as, 


= SS ib 4. 
B, = vamp 5 aa | (4.3) 


les 


The predictor of the population total takes then the form, 


%= Dy + UG, &;). 


ies JES 


(4.4) 


Alternatively, it follows from (4.1) that, 


£2 —C,(x,)]’ Ix ‘ 

v(x ;) 1 

=E. Ly; —C,(x,))? 
V(X ;) 


pes Pe er RCE Pelee 
Ewa -l V(x ;) ) 


where the right hand side expectation is with respect to the 
joint distribution of (y,, x jp: Thus, B can be estimated as, 


A [ ie Ce UE 
a eee 


les v(x; ) ao) 


since E,(w; )= constant. The predictor of Y with B 
Sneed by B, is therefore, 


¥, = divi t VC; (x). 


ies Jés 


(4.7) 


Remark 4. A notable advantage of the use of the predictor 
Y, over the use of the predictor Y, is that it does not require 
the identification and estimation of the expectation 
w(x) = E,(w |x). On the other hand, in situations where 
this expectation can be estimated properly, the predictor Y, 
is likely to be more accurate since the weights 7, = 
(w; —1)/[E,(w;!x;)—1] will often be less variable than the 
weights (w,;—1). This is because the weights 7, only 
account for the net effect of the sampling process on the 
target conditional distribution f,(y,!x;), whereas the 
weights (w,—1) account for the effect of the sampling 
process on the joint distribution f.(y,,x;). In particular, 
when w; is a deterministic function of x, such that 
w, = W(x; ), the sampling process is noninformative and 
f.(y1x,) = f(y) X)) = fp Vi! X;). In this case the esti- 
mator 8, (but not £,) coincides with the optimal 
generalized least square (GLS) estimator of B since 7, =1 
and the model (4.1) holds for the sample data. (For the data 
analysed in section 7, the empirical variance of the weights 


r, is 1.36, whereas the empirical variance of the weights w;, 
is 2.66). In contrast to this, when the sampling weights w, 
are independent of x;, the estimates B, and Bs;, and hence 
the predictors Y, and Y, are equal since w(x; ) = constant. 

An interesting special case of the predictor Y, arises 
when the working model postulated for the sample- 
complement is linear with an intercept term and constant 
variance. Let x’ = (1, X;). As easily verified, the estimator 
in this case takes the form, 


Y.0.4.0=y e-file ee Se (4.8) 


where X(c)= Des X;%o.Xc) =[(N -)/Dies(w, DI 
[yes (We Dy, sx, and B. is the probability weighted 
estimator of the vector coefficient of x; but with the 
weights (w, —1) instead of w,. 


Remark 5. The predictor 1) Reg can be obtained as a 
special case of the Cosmetic predictors proposed by Brewer 
(1999). It should be emphasized, however, that the 
development of the cosmetic predictors and the derivation 
of their MSE assumes explicitly noninformative sampling. 
An important property of 1; Reg 1S that under general 
conditions it is design consistent for Y, irrespective of the 
true sample-complement model (see Lemma 1 below). 
Many analysts view “design consistency’ as an essential 
requirement from any predictor; see the discussion in 
Hansen, Madow and Tepping (1983) and Sarndal (1980). 
The following Lemma 1 defines conditions under which the 
more general predictor Vs of (4.7) is design consistent for Y. 


Lemma 1. The predictor ie is design consistent for Y if the 
working model used for the computation of B, satisfies the 
conditions, i-C,(x) has an intercept term, ii- C(x) is 
differentiable with respect to B in the AF apeitiootl of 6, 
and iii- v(x) = constant. 


Proof: By (4.6) and condition iii, 6, = arg min 5 
DVies (Ww; -Dby; - Cex I and by condition i, C, (x) = 
Bo +C,, 5 (X), so that by condition ii, 0/0B5 
{Y-, Ww, -Db, —Cr(K)T I, 9 =0, which implies 
Lies (w; —DLy, — Cg (%;)] = 0 or, 

Mi) =) Fe HDC, (x,)- DC, (x,). (4.9) 


The proof is completed by noting that under mild 
regularity conditions >);.,w;y; iS design consistent for Y, 
and Die; W; C; (x,;) 1s design consistent for ye iC, (x): 
Thus, the fe hand side of (4.9) converges in Senay to 
ys while the left hand side converges in probability to Y. 

It is important to emphasize again that the Lemma does 
not assume that the working model is the correct sample- 
complement model. 


Survey Methodology, June 2004 


The use of the predictors Ve and a requires a 
specification of the sample-complement model. Next we 
develop another predictor that only requires the iden- 
tification and estimation of the sample model. The approach 
leading to this predictor is a sample-complement analogue 
of the “bias correction method’ proposed by Chambers et al. 
(2003). The proposed predictor is based on the following 
relationship, 


DE.) x)= DEO] x;) 


vim} ot ~ LE, {ly,- E,(y;| x; [xl] x i 
= YE, (y| X)) 
+m] DE ly, -E Anes rh (4.10) 


Where in the second row we replaced the sample- 
complement average of the conditional expectations 
E.(y ,!x,) by its expectation over the sample-complement 
distribution of the x-values (n denotes the sample size). By 
(2.9), 


E,ly; -E,(y,|x,)] 


5 col decal -E,(y,|x;)] (4.11) 
[E,(w,)-U : , 


implying that the sample-complement mean in the second 
row of (4.10) can be estimated as M,.=l1/n 
Dies {I(w; — Dw, — DILy; - E,(y;!x))]}, where w,= 
Dies W; /n. The proposed predictor therefore takes the form, 


¥, => y+ DE) k)+W-nM, (4.12) 


ies Jés 


with E, (y ,|x,) estimated from the sample data. The use 
of Y, only requires the identification and estimation of the 
sample regression E,(y,!x,), which can be carried out 
using conventional regression techniques. Moreover, under 
mild conditions Y is design consistent for Y even if the 
expectation E,(y,!Ix,) is musspecified. This property 
follows from ihe fact that > PN Be (y Ix, 7) is design 
consistent for Di jesE,(y,;!x,) and (N —n) M. is design 
consistent for M. = Dies Ly; —E,(y, x ,)]. 

Remark 6. If the model fitted to the sample data is linear 
regression with an intercept and constant residual variance, 
the difference between the predictor ¥ Reg defined by (4.8) 
and the predictor 2 is that ie Reg USeS a consistent 
estimator for the regression coefficients defining the linear 
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approximation to the model holding for the sample- 
complement, whereas in "s the regression coefficients are 
estimated by ordinary least squares (OLS), thus estimating 
the linear approximation to the sample model. 

Finally, rather than only predicting the sample- 
complement values as with the previous predictors, one 
could instead predict all the population values by their 
estimated expectations under the population model. Assum- 
ing that the latter model is linear regression with an intercept 
term and constant residual variance, application of (2.5b) 
yields, 


B =argmin E,(y, -x',B)° 


EL (Ye = XB) 


= arg min 
B E,(w,) 


(4.13) 
Estimating the sample expectation in the numerator of 
(4.13) by the corresponding sample mean (application of the 
method of moments) and minimizing the sample mean with 
respect to i yields the LS engi, weighted 
estimator Boy = = (Xy W, Xia) nC ts) W, Y,), where 
Cre na Ese face a a W, = Diag[w,...w, ]. 
Let x, ae x,). Estimating E,(y,! x, Tox, B 
By +X, a8 pw and summing over all the population Alice 
field the familiar generalized regression (GREG) estimator 
(Sarndal 1980), 


ae W; Ji et) 


Yoruo=N TE +B, EN 


oy WA; , 
e) 


ies ? 


LS Ae (4.14) 
Remark 7. By considering the estimation of Y as a 
prediction problem, the use of the predictor Yj Reg in (4.8) 
requires the prediction of (N -n) values whereas the use of 
the GREG requires the prediction of N values. Hence, in 
situations where both the sample-complement model and 
the population model can be approximated fairly well by 
linear regression models with intercept terms (but possibly 
with different vectors of coefficients for the two models), 
one expects that for sufficiently large sampling fractions 
n/N _ the predictor 1 Reg Will be superior (see the empir- 
ical results in section 7). 


5. EXAMPLES 
5.1 Prediction with No Concomitant Variables 


Let x, =1 for all 7. By (3.2), 
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yory + E.(y,)=>y; 


ies jJés ies 


+(N=n)E, oe »} (5.1) 


E,(w,)-1 


Estimating the two sample expectations in the right hand 
side of (5.1) by the respective sample means yields the 
estimator, 


=: 


ies i€s 


= hase We Wye 
2* a es Yi; 
In (5.2), Dies(w,-Dy;, 18 a ‘Horvitz-Thompson 
estimator’ of > je, y;- The multiplier (N —n)/¥,.,(w; —)) 
is a ‘Hajek type correction’ for controlling the variability of 
the sampling weights. Notice that i. is a special case of 
the predictor et Reg defined in (4.8), obtained by setting 
x; =1 for all. It is also a special case of the predictor Y, if 
one estimates EG =y=).,y,/n. For sampling 
designs such that ¥'-,w; = N for all s, or if one estimates 
E.(w,) =N/n, the predictor Y zy reduces to the familiar 
Horvitz-Thompson estimator of the population total, 
or ele a 
As with the GREG estimator considered in section 4, 
rather than predicting the sample-complement total Y, = 
Djes ¥; and using the predictor Y zr» One could predict all 
the population y-values by estimating their expectations 
under the population model. By (2.5b), £,(y,;)= 
E(w, y;)/ E,(w;). Estimating the two sample expectations 
by the corresponding sample means yields the familiar 
Hajek estimator, 
Wi Yi 
E , (w; ) 


Here again, we anticipate Y zy to be more precise than 
Ya as the sampling fraction increases (see also the 
empirical results in section 7). Note that Vs and Gee are 
the same and coincide with the Kee Het oh 
estimator for sampling designs satisfying >; =: 


(5.2) 


Visser pe Ey (y,) =NE, ; 


ies Wi ies om 


= wy; 


ies W 


5.2 Optimal Prediction with Concomitant Variables, 
Comparison with Optimal Predictors Under 
Noninformative Sampling 

Let the population model be, 
yy SHR ey p Bees] x;) =: 


E(e;|X;)=vX,), 2, (6,85) Xq Xp) =O, ged (54) 


and suppose that the sample inclusion probabilities can be 
modeled as, 


t. =KX Ly, 


U 


E,(8,| x;, y;)=0 6.5) 


where H, (x), v(x) and g(x) are positive functions and K 
is a normalizing constant. (Below we consider the special 
case of ‘regression through the origin’). This sampling 
scheme is considered for illustration only, although in 
section 2 we mention several practical situations where the 
sample selection probabilities depend directly on the y and 
x-values. In particular, this is the case with the data set 
analysed in section 7. Under (5.4) and (5.5), 2 (x;)= 
E ,(m;|x;) = KH, (x; )g(Xx;). Hence, by (2.9), (5.4) and 


(5.5), 
E cytes She 
9] x) =E, ier arwid 


= 1—n(x,)— Ke ,9(x,;)- Ko, 
ees 1—1n(x,) 


vx, 
Kg (x; )v(x;) 
= E,ly,]x))- 1—n(x,) 


The last expression in (5.6) shows that E,(y,!xj)< 
E,(y,!x;)=H,(x,), which is clear since for the 
inclusion probabilities defined by (5.5), the sample- 
complement tends to include the units with the smaller 
y-values for any given x-values. Note, however, that as 
ni N SOK > 0" and’ Esl xs9— Ey x) 
(see Remark 2). 

As a special case of (5.4), consider the case of a single 
auxiliary variable x and let H,(x) = xB and v(x) =0°x 
(‘regression through the origin with variance proportional to 
x’). For noninformative sampling and known £, the optimal 
unbiased predictor of Y minimizing £ UY - Y)71D,] is in 
this case, Y = ety + BE gery. In the practical case of 
unknown , the optimal unbiased predictor of Y is the 
familiar Ratio estimator A =N V(X /x) with y denoting 
the sample mean of Y and (x , X ) denoting the sample and 
population means of x (Brewer 1963, Royall 1970). 

Now let g(x)=1 in (5.5) for all x, so that a; = 
nCVE HOLES. -|(y, +6,). For sufficiently large N, we can 
approximate 7, = n( y, +6,)/((NBX), implying — that 
1(x; )= oe (1, i )= nAHTENX ). By* (5.6), E o(y, I1x)= 
x ;B - om x; AL BCE AX sae ;)] where sige La is the 
Saami fraction, so that “for known B and o° the optimal 
predictor of Y is, 


g(x; )+8.], 


(5.6) 


x; 
ere 
Tere 


Lemma 2: Let the population model be defined 4 (5.4) 
with H,(x)= xB and v(x)=o*x. Assume also 


=) +B x; - 


ies J€s 


Yr, Regma 
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ae (c; |x,) =0. Suppose that the sample units are selected 
independently with probabilities defined as in (5.5), with 
etx) Sarl hen 


MSE, (Y,. Reg D,)= 
Bays a ASHBY eS Tie y Gf & ,)7 1B) 


Proof : By the independence of the population values and of 
the sample selections, 


MSE aa a D,) 
ssl oer Reg 


=D jer Be (ly; —E.(y,] 2 JP | ;}. 


By 15.6), ly; ook. (vila) = 185+ x, faa, LY 
where x, =Ko°x,, K=n/BNX and n(x,)=E,(,1x))= 
nx, / (NX). Hence, 
E, (ly, - E,(y; |x) |x)} 
= E,(e5|x,)+2x;/d—n(x,))E, (€,|x)) 


+(x; /d-n(x,))P. 


Now, 
E,(e3|x;) 
= E,[1—n,/(-n(x,) )e3|x,] 
= E,[l1—n(x,)—Ke, —K8, /(—n(x,)) €3|x)] 
yt tla Mest he 
and 


E.(e ;|x;) 
= E ,[1—n(x,)- Ke, - K8,/(\-x(x,)) é jx] 
=-x' /(1-n(x,)). 


It follows therefore that MSE WY E, 
Djeslx;/U-a(x,))]?. QE. D. 


Remark 8: For noninformative sampling and with known 
B, the prediction MSE of the optimal predictor Y = 
sit eel Ac a Upp has acne ce (0 Mig he [BPN are tap Pele neg BAS 
MSE is larger than the MSE obtained under the informative 
sampling scheme defined by the Lemma, which is obvious 
since the latter scheme tends to sample the units with the 
larger y-values and hence also with the larger x-values and 
the larger standard deviations. 


rep |D,) = 9° dijes% jus 


6. MEAN SQUARE ERROR ESTIMATION 


Estimating MSE(Y1D,)=£,[(Y-Y)?1D,] for the 
predictors Y considered in section 4 requires strict model 
assumptions that could be hard to validate. This is largely 
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due to the conditioning on the design information D,. In 
order to deal with this problem, we propose to estimate 
instead the unconditional MSE, MSE(Y) = = El(Y - Y)*]= 
E, {E, (Y - Y)’ ID ;]}, where E, =E,£E, defines the 
expectation over the sample disttibution ingn the selected 
sample) and over all possible sample selections. Notice that 
E »UY — Y)’ ID ,] can be viewed as a random variable 
Wiis ), so that MSE(Y) = E, (u(D,)] defines its ‘best 
predictor’ with respect to the mean square loss function 
under the distribution f, over which the expectation Ep 

is taken. By changing the order of the expectations, the 
unconditional MSE can be expressed as, 


MSE(Y) = E,E,Epl -Y)"|y] 


= E,E,[(¥ -Y)’|y] (6.1) 


where y ={y,; ie U}. Estimating the unconditional MSE 
of any of the predictors Y can be carried out therefore by 
estimating its randomization MSE, see Pfeffermann (1993) 
for further discussion. Estimation of the randomization MSE 
of the various predictors has the additional advantage of 
allowing their use under the design based approach. 
Estimation of randomization variances of design based 
estimators is considered extensively in the literature and 
many diverse methods are in routine use. However, in view 
of the complicated structure of some of the predictors 
considered in this study and in order not to restrict to 
particular sampling schemes, we propose below the use of a 
two-step procedure that combines an inverse sampling 
process (Step 1) and what can be viewed as a bootstrap 
resampling algorithm (Step 2). A notable advantage of this 
procedure is that it is general and applies ‘equally’ to all the 
predictors. Also, unlike other variance estimation methods 
in common use, it does not require knowledge of the pair 
wise joint selection probabilities 72, =Pr(i, jes). As 
discussed later, a valid application of the first step requires 
sufficiently large samples. The two steps of the proposed 
procedure are as follows: 
Step 1- Generate a single ‘pseudo population’ by selecting 
with replacement N units from the original sample with 
probabilities proportional to w, =1/7;, where WN is the 
population size. The justification for this step is given 
below, see also Remark 10. Denote by Y,,, the sum of the y- 
values in the pseudo population. 


Step 2- Select independently a large number B of bootstrap 
samples from the pseudo population generated in Step 1, 
using the same sampling scheme as used for the selection of 
the original sample, and re-estimate the population total. 

Let Y represent any of the predictors and denote the 
predictor obtained for bootstrap sample b by Y os Estimate, 
=e ia? pp i ae 


EXY.2Yy = (6.2) 
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The performance of the estimator (6.2) in estimating the 
randomization MSE depends obviously on the ‘closeness’ 
of the pseudo population generated in Step | to the actual 
population from which the original sample was drawn. The 
closeness of the two populations can be verified in part by 
noting that the marginal distribution of y, |x, in the pseudo 
population is the same as in the original population. To see 
this, note that the pseudo population generated in Step 1 is a 
‘sample with replacement’ from the original sample with 
selection probabilities Cw, on each draw, where 
C=1/X7,w,. Denoting by f,,(y,!x;) the marginal 
pseudo population distribution we find using (2.2) and 
(2.5a), 


E,(Cw,|x;) 


LAGE 


X,)= 


_ Ey mile) F. (vii) (6.3) 


= ees 

E, (m;|¥;.X;) fy y 
Remark 9. Equation (6.3) only refers to the marginal 
distribution of y,|x,;. Like with the standard bootstrap 
method, a successful application of the proposed procedure 
requires that the original sample size is sufficiently large and 
that the sample measurements are approximately inde- 
pendent. Pfeffermann et al. (1998) establish conditions 
under which for independent population measurements the 
sample measurement are ‘asymptotically independent’ 
under commonly used sampling schemes with unequal 
selection probabilities. 


Remark 10. Step | is similar and asymptotically equivalent 
to duplicating sample unit i w, times. Notice, however, that 
the use of this duplication procedure does not yield pseudo 
populations of size N unless >, w, = N. It is also not clear 
how to establish the relationship (6.3) when using this 
procedure. 


7. EMPIRICAL ILLUSTRATIONS 
7.1 Description of Empirical Study 


In order to illustrate the performance of the predictors 
and the associated MSE estimates discussed in previous 
sections we use a real data set, collected as part of the 1988 
U.S. National Maternal and Infant Health Survey. The 
survey uses a disproportionate stratified random sample of 
vital records with the strata defined by mother’s race and 
child’s birth weight; see Korn and Graubard (1995) for 
details. For the empirical study in this section we considered 
the sample data as ‘population’ and selected independently 


1,000 samples with probabilities proportional to the inverse 
of the original sampling weights, using a systematic PPS 
sampling scheme. The list of ‘population units’ was 
randomly ordered before every sample selection. For each 
sample we predicted the population total of birth weight 
(measured in grams, divided by 10,000 in the present 
study), using gestational age as the auxiliary variable 
(measured in weeks). The sample inclusion probabilities 
depend therefore on the values of the study variable that 
defines the original strata. Notice that although the original 
sample was supposedly a stratified random sample, the 
sampling weights actually vary within the strata, which is 
why we used systematic PPS sampling for the simulation 
study. We considered three different sample sizes, n = 232, 
1,145, 2,429. The ‘population’ (original sample) size is 
N=9,948. (For n = 232, 0.002 < 2, = Pr(ie s) < 0.15. For 
n= 145 7a 0 0h<n, < 0735. 9 Por Tri 2 420 goo 
mt, <0.99 with mean 2=0.26 and standard deviation 
Std(m;) =0.29. In the latter case some of the units were 
drawn almost with certainty). 

Some of the predictors considered for this study (see 
below) require the specification of either the sample model 
or the sample-complement model. We assumed for both 
models the third order polynomial regression, 

Vp Pe Bi ee, Aare: (7.1) 
with independent residuals and constant variance. This 
model was found by Pfeffermann and Sverchkov (1999) to 
give a good fit to the ‘population’ (original sample) data 
with R* =0.61 (see Figure 1), and it was found also to fit 
fairly well the sample data (with different coefficients) for 
several samples selected from this ‘population’. Notice, on 
the other hand, that with this strongly informative sampling 
scheme, it is unlikely that the sample model, the population 
model and the sample-complement model are all from the 
same family even if with different parameters. The present 
study enables therefore studying the performance of the 
various predictors when some or all of the three models are 
misspecified. This important robustness question is further 
examined by fitting simple regression models instead of the 
third order polynomial regressions that is, by omitting the 
second and third powers of the auxiliary variable. The only 
exception is the model dependent predictor Y, (Equation 
4.4) where no coherent estimator for the expectation 
E,(w,!x,;) could be found when restricting to simple 
regression. (The method considered in Pfeffermann and 
Sverchkov (1999) for the estimation of this expectation 
assumes normality of the population model residuals. This 
is a valid assumption when fitting the third order polynomial 
regression model but is clearly violated when dropping the 
second and third powers of the auxiliary variable). 
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U.S. National Maternal and Infant Health Survey, 1988. 
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Gestational Age 


Model Fitted: 


y; =17886-1827.7x, +61.2x? —0.61x} +¢, 


Var(e;) = 603.2, R? =0.61 
Figure 1. Scatterplot of Birth Weight against Gestational Age in ‘Population’ (original Sample), and Predicted Values 


Under 3"? Order Polynomial Regression. 


The predictors considered for this study divide therefore 
into three groups. The first group consists of predictors that 
only use the sample y-values and the sampling weights. 
are in this group are the Horvitz-Thompson estimator 

= Dies w;y; » the predictor sae defined by (5.2) and 
aan s estimator Vee defined by (5.3). The second group 
consists of predictors that use the working model defined by 
(7.1). Included in this group are the two regression pre- 
dictors ye and a, Reg defined by (4.4) and (4.8) respec- 
tively, the bias corrected predictor ¥ defined by (4.12) and 
the GREG estimator defined by (4.14). The third group 
contains the same predictors as the second group (except for 
Ye see above), but based on the simple regression model 
(only the first power of x). 

The MSEs of all the predictors considered in this study 
have been estimated by use of the two-step procedure 
described in section 6. However, because of computing time 
limitations, the MSE estimators were only computed for a 
random selection of 200 out of the 1,000 samples and are 
based on only 200 bootstrap samples from each pseudo 
population. For assessing the performance of the MSE 
estimators we computed the corresponding empirical MSEs 
based on the 1,000 samples selected from the study 
population. Thus, the ‘true’ MSE of a generic predictor vi 
was computed as, 


MSE(?) =—— ey 


Gra tes 70% (7.2) 


where Ye denotes the predictor computed from the r” 


sample. Notice that since the population values are fixed, 
the MSE in (7.2) is the randomization MSE over all possible 
sample selections, which is what the estimator (6.2) is 
intended to estimate. 


7.2 Results of Empirical Study 


The main results of this study are exhibited in 
Tables 1.1 — 1.3 (one table for each sample size). The third 
column of each table shows for every predictor Y the 
empirical bias, [(~%, Y,,/R)-Y], and the standard 
deviation (Std) of ihe empirical bias, computed as 
pps itl, (r) al ahaa P aipine pg R = 1,000. 
The next two columns show respectively the ‘true 
(empirical) RMSE (square root of Equation 7.2), and the 
square root of the mean of the corresponding Bootstrap 
estimators defined by (6.2). 

The main conclusions from Tables 1.1 — 1.3 are as follows: 


1- All the predictors considered for this study are virtually 
design unbiased with all three sample sizes, irrespective 
of the underlying working model. The predictor ¥ has a 
statistically significant bias when tested by use of the 
conventional t-statistic but the actual bias is negligible 
when compared to the true population total. (The 
predictor Y is the only predictor considered in this 
study that is not design consistent). 


The next three comments refer to the RMSE of the 
various predictors. 


a 
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2- The predictors in Groups 2 and 3 that use the auxiliary 
values perform much better than the predictors in Group 
1, particularly for the smaller sample sizes. The pre- 
dictors in Group 2 that employ the 3" order polynomial 
regression model (7.1) perform better than the corre- 
sponding predictors in Group 3 that employ the simple 
regression model as the working model, but the dif- 
ferences diminish as the sample size increases. 


3- An important result emerging from this study is that the 
predictors ies Reg and 4 zy (and also ‘2 for the larger 
sample sizes), that only predict the y-values for units 
outside the sample indeed perform better than the other 
predictors in their respective groups (see also below). As 
surmised in Remark 7, this holds particularly with the 
larger sample sizes. Notice that the differences between 
¥ rez and the GREG estimator for n=1,145 and 
n=2,250 are smaller under the polynomial model 
(Group 2) than under the simple regression model 
(Group 3), which is explained by the tight relationship 
between the study variable and auxiliary variables under 
the polynomial model. The predictor Ve is less stable 
than reg for n= 232 but for the other two sample 
sizes the two predictors perform similarly. 


4- The predictor ie Reg Performs somewhat better than the 
model dependent predictor Y, that employs the 
expectations E(w, |x,) to adjust the sampling weights. 
We have no clear explanation for this result because as 


illustrated in Pfeffermann and Sverchkov (1999) using 


the same data, adjusting the sampling weights improves 
the estimation of the regression coefficients very 
significantly. 


Next consider the MSE estimators. 


5- The MSE estimators developed in section 6 perform 
very well for all the predictors and with all the sample 
sizes. For the sample size n = 232 there is a systematic 
under-estimation of the RMSE by up to 3%, which is 
explained by the fact that the pseudo population is in this 
case less variable than the actual study population (see 
Remark 9). The MSE estimators are almost unbiased for 
the other sample sizes with the largest difference 
between the estimated and true RMSE being again in the 
magnitude of 3%. 


Another way of assessing the bias of the various 
predictors and their MSE estimation is by studying the 
coverage properties of confidence intervals defined by these 
predictors. Tables 2.1—2.3. compare the empirical 
percentage coverage of the standard confidence intervals 
y + Awe MSE_ with the corresponding nominal 
percentages for selected values of a (one table for each 
sample size). The empirical percentages are somewhat 
erratic with n= 232 sample units but they stabilize as the 
sample size increases, particularly with the use of the 
predictors in the second and third group. The empirical 
percentages are close to the nominal percentages with all the 
predictors when n = 2,250. 


Table 1.1 


Bias, RMSE and Square Root of Mean of MSE Estimators, n = 232 


Group Predictor 
Yur 
1 5 
Ms 
No x-values : of 
Y Hajek 
2 Y, 
3" order Yi Ree 
polynomial Y; 
regression ye 
Y, R 
3 ae 
: ¥; 
Simple Regression é 
YGREG 


True ‘population’ total= 2710.7 


MSE 


Bias (Std) RMSE 
-4.5 (11.6) 365.1 355.0 
1.5 (2.9) 91.1 89.8 
1.7 (2.9) 93.0 91.6 
4.4 (2.0) 64.0 63.0 
3.5 (2.0) 63.4 62.4 
-0.3 (2.1) 65.4 65.0 
3.4 (2.1) 63.6 62.6 
DSi) 68.0 66.2 
-0.3 (2.2) 68.6 67.4 
26) 68.3 66.5 
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Table 1.2 
Bias, RMSE and Square Root of Mean of MSE Estimators, 1 = 1,145 

Group Predictor Bias (Std) RMSE V¥MSE 
ee -9.1 (5.0) 157.1 156.1 
1 Ve 0.0 (1.1) 35.2 34.9 

No x-values Sn 
Fae ZO 1:3) 39.5 39.3 
Y, 3.0 (0.9) 27.6 2 

a) A 
Y 2.0 (0.9 27.4 27.3 

3rd order eee: ( ) 

polynomial Y, 0.5 (0.9) 27.4 Ziel 
ete ip von 1.7 (0.9) 27.8 DEM 
BS ties 0.0 (1.0) 28.3 28.7 
3 ¥; 0.1 (1.0) 28.2 28.9 

Simple Regression ‘ 
ones 0.0 (2.0) 29.1 29.6 


True ‘population’ total= 2710.7 


Table 1.3 
Bias, RMSE and Square Root of Mean of MSE Estimators, n=2,250 
Group Predictor Bias (Std) RMSE MSE 

axe 1.3 (2.7) 82.7 80.4 
I ee -0.2 (0.6) 18.5 18.8 

No x-values ds 
ee 0.1 (0.7) 23.5 23.8 
i 1.3 (0.5) 17.5 17.3 

Ds, A 
Y. 0.6 (0.5 16.9 16.3 

3" order SN) >) 

polynomial Vs -0.3 (0.5) 17.1 16.5 
hh Y oi 0.5 (0.5) 17.9 18.3 
Ye -0.3 (0.5) 17.3 16.8 
3 VY -0.3 (0.5) ite Ae 1s 


Simple Regression im 
Yorrc -0.2 (0.6) 18.8 18.3 


True ‘population’ total= 2710.7 


Table 2.1 

Nominal and Empirical Percentage Coverage of Confidence Intervals, n = 232 
Group Predictor 1.0 Paes) 5.0 10.0 90.0 95.0 97.5 99.0 
ye 25 oe) oo 10.0 90.0 97.0 99.0 99.5 
1 ee 0.5 2.0 4.0 8.0 88.5 91.5 95.5 98.0 
No x-values Yee 0.5 2.0 4.0 8.0 88.5 91.5 95.5 98.0 
Yy, 0.0 0.0 15 6.5 86.0 90.5 92.5 97.5 
Ds Y>. Ree 0.0 0.0 2.0 7.0 85.0 90.5 93.5 98.0 
3" order polynomial | ie 0.0 0.5 pa 6.5 87.5 91.0 95.0 98.5 
regression eee 0.0 0.0 2.0 70 85.0 90.5 93.5 98.0 
YS Reg 0.0 1.0 25 7.0 87.0 91:5 97.5 98.0 
3 Y, 0.0 1.0 2.5 7.0 86.0 91.5 96.5 98.0 


Simple Regression toate 0.0 1.0 25 7.0 86.5 91.5 97.0 98.0 
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Table 2.2 

Nominal and Empirical Percentage Coverage of Confidence Intervals, m = 1,145 
Group Predictor 1 2.5 5.0 10.0 90.0 95.0 97.5 99.0 
Yat 4.0 7.0 9.0 13.5 95.5 98.0 98.5 99.5 
1 Ya 3.0 5.0 8.0 12.5 92.5 95.5 99.5 100.0 
Nowiavahies ae 3.5 5.0 9.5 12.5 92.5 96.0 99.5 100.0 
Y, 0.5 2.0 5.0 aS 86.5 93.5 96.0 97.0 
7 Yolen 0.5 3.0 6.0 9.0 86.5 94.5 96.5 97.0 
Lan ean me ney i; 0.5 2.0 6.0 9.5 88.0 94.0 97.0 98.0 
oes Vaca 0.5 3.0 5.0 9.0 86.5 94.0 96.5 98.0 
oe 0.5 3.0 6.0 11.0 90.0 93.0 97.0 99.5 
3 ye 0.5 ees 5.5 10.5 90.0 94.0 97.0 99.5 
Simple Regression Pate 1.0 3.0 6.0 11.0 90.5 94.0 97.5 99.0 

Table 2.3 

Nominal and Empirical Percentage Coverage of Confidence Intervals, n = 2,250 
Group Predictor 1.0 OFS a0 10.0 90.0 95.0 OTS 99.0 
ies 0.5 1.0 G5 11.0 95.0 97.5 99.0 99.5 
1 Ve 1.0 3.0 5.5 9.0 91.5 96.0 99.0 99.5 
Nonovalise ee 1.0 2.5 a5 9.0 93.0 97.0 98.5 99.5 
2 0.5 2.0 5.0 9.0 91.0 94.5 96.5 97.5 
2 ie 0.5 25 6.5 10.5 90.5 94.5 96.5 98.0 
siicaiennoivnemtl i 0.5 2.0 7.5 12.5 91.5 95.5 96.5 97.5 
esheeage Vee 0.5 2.0 6.0 {L.05> * 910 94.5 96.0 98.0 
“aes 1.0 3.0 6.0 11.0 91.0 95.0 97.5 99.0 
3 ie 1.0 2.0 6.0 12.0 90.0 95.0 97.5 98.0 
Simple Regression Vesna 0.0 1.5 5.0 11.5 91.5 95.0 97.5 99.0 


As implied by the theoretical developments of this article 
and illustrated in the empirical study, predicting only the y- 
values for units outside the sample employing the sample- 
complement model yields better predictors for the 
population total than predicting all the population values by 
use of the population model, as implicitly implemented 
when using the GREG or Hajek’s estimators. Clearly, the 
differences are only appreciable when the sampling 
fractions are not negligible. 

In order to highlight this point further, we present in 
Table 3 the mean prediction error (mpe) in the original scale 
(grams) over the 1,000 samples when predicting the sample- 
complement values; 


mpe = 7" bee Gyr )| /1,000 


where S, defines the r” selected sample. The mpe’s are 
shown for three predictors, all utilizing the working model 
(7.1) and thus having the general form, y= B, + 
Bix; +Box; +B3x; j¢s. For the first predictor the 
vector B = (By, B,,B,,B;) is estimated by OLS, which 
corresponds to the use of the sample model; for the second 
predictor $B is estimated by the probability weighted 
estimator B pw» that corresponds to the use of the population 
model whereas for the third predictor B is estimated by the 
estimator B. which is computed similarly to B pw Dut with 
weights (w, —1), that corresponds to the use of the sample- 
complement model. 
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Table 3 
Mean Prediction Errors and Std of Means (in brackets) Under Three Prediction Models 


Sample size Sample Model 
252 329.0 (2.2) 
1,145 375.0 (0.9) 
2,250 387.5 (0.6) 


The clear conclusion emerging from Table 3 is that the 
use of either the population model or the model holding for 
units in the sample for the prediction of y-values of units 
outside the sample can result in appreciable biases. Notice 
that the bias induced by use of the population model 
increases as the sampling fraction increases, which agrees 
with the previous discussion asserting that the difference 
between the sample and sample-complement models only 
shows up with relatively large sample sizes (see Comment 
2): 


8. CONCLUDING REMARKS 


In this article we use the sample and sample-complement 
distributions for developing design consistent predictors of 
finite population totals. Known predictors in common use 
are shown to be special cases of the present theory. The 
MSEs of the new predictors are estimated by a combination 
of an inverse sampling algorithm and a resampling method. 
As supported by theory and illustrated in the empirical 
study, predictors of finite population totals that only require 
the prediction of the outcome values for units outside the 
sample perform better than predictors in common use even 
under a design based framework, unless the sampling 
fractions are very small. The MSE estimators are shown to 
perform well both in terms of bias and when used for the 
computation of confidence intervals for the population 
totals. Further experimentation with this kind of predictors 
and MSE estimation is therefore highly recommended. 
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Weighted Estimation in Multilevel Ordinal and Binary Models 
in the Presence of Informative Sampling Designs 


LEONARDO GRILLI and MONICA PRATEST’ 


ABSTRACT 


Multilevel models are often fitted to survey data gathered with a complex multistage sampling design. However, if such a 
design is informative, in the sense that the inclusion probabilities depend on the response variable even after conditioning 
on the covariates, then standard maximum likelihood estimators are biased. In this paper, following the Pseudo Maximum 
Likelihood (PML) approach of Skinner (1989), we propose a probability-weighted estimation procedure for multilevel 
ordinal and binary models which eliminates the bias generated by the informativeness of the design. The reciprocals of the 
inclusion probabilities at each sampling stage are used to weight the log-likelihood function and the weighted estimators 
obtained in this way are tested by means of a simulation study for the simple case of a binary random intercept model with 
and without covariates. The variance estimators are obtained by a bootstrap procedure. The maximization of the weighted 
log-likelihood of the model is done by the NLMIXED procedure of the SAS, which is based on adaptive Gaussian 
quadrature. Also the bootstrap estimation of variances is implemented in the SAS environment. 


KEY WORDS: Informative design; Multilevel ordinal model; Multistage sampling; Pseudo Maximum Likelihood; 


Weighting. 


1. INTRODUCTION 


Multilevel models for ordinal responses, including 
binary responses as a special case, are frequently used in 
many areas of research for modelling hierarchically 
clustered populations. In fact, both in human and biological 
sciences, the status or the response of a subject may often 
be classified in two categories or in a set of ordered 
categories (ordinal or graded scale). At the same time, 
subjects are observed clustered in groups (e.g., schools, 
firms, clinics, geographical areas). The hierarchical popu- 
lation structure is often also employed to design multistage 
sampling schemes, with unequal selection probabilities at 
some or all the stages of the sampling process. In the 
multilevel analysis of survey data, complex sampling 
schemes are often ignored even if they may cause the 
violation of the basic assumptions underlying multilevel 
models. In fact, in complex sampling designs both the 
subjects and the clusters at all levels could be selected with 
probabilities that, even conditionally on the covariates, do 
depend on the response variable; in other words, the 
sampling design might be informative. 

For data that are clustered and obtained by multistage 
informative designs, proposals for fitting multilevel models 
have been formulated mainly for the case of continuous 
response variables. In particular, Pfeffermann, Skinner, 
Holmes, Goldstein and Rasbash (1998) propose proba- 
bility-weighting procedures of first and second level units 
that adjust for the effect of an informative design on the 


estimation in two-level models with a continuous response 
variable. The method, known as Pseudo Maximum 
Likelihood (PML), consists in writing down a closed form 
expression for the census likelihood, estimating the 
log-likelihood function and then maximizing the estimated 
function numerically. The method needs the sampling 
weights for the sampled elements and clusters at all levels. 
The authors also develop appropriate “sandwich’ estimators 
for the variances of the estimators. 

The work of Pfeffermann et al. (1998) is mainly 
concerned with the implementation of the PML principle in 
the IGLS (Iterative Generalised Least Squares) algorithm 
(Goldstein 1986), which is suitable for linear multilevel 
models. The probability-weighted IGLS algorithm is 
available in the widespread package MLwiN (Rasbash, 
Browne, Goldstein, Yang, Plewis, Healy, Woodhouse and 
Draper 1999). However, the extension to nonlinear models 
is not trivial. For the nonlinear case the developers of 
MLwiN implemented a weighting procedure that parallels 
the one used for linear models with some ad hoc solution 
for the level 1 variation: for example, for binary responses 
the subject-level weights are included in the binomial 
denominator. The proposed method is straightforward to 
implement, but its properties have not been investigated yet. 
Moreover Renard and Molenberghs (2002) report the case 
of an application where the aforementioned algorithm for 
weighting in multilevel binary models did non converge or 
yielded implausible results. 


' Leonardo Grilli, Dipartimento di Statistica, Universita di Firenze. E-mail: grilli@ds.unifi.it; Monica Pratesi, Dipartimento di Statistica e Matematica applicata 
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The simulation study which we will use to judge the 
performance of the PML estimators will closely follow the 
lines of Pfeffermann et al. (1998), since they use a similar 
approach for the linear model, so that some interesting 
comparisons are possible. However, when making the 
comparisons it should always kept in mind that, while in the 
two-level linear model the two variance components can be 
estimated separately, in the two-level binary model only a 
ratio of the two variance components is estimable, as 
discussed further on. 

A recent paper which deals with the estimation of 
variance components is Korn and Graubard (2003), whose 
work is motivated by the substantial bias showed in small 
samples by several weighted estimators of variance compo- 
nents proposed to adjust for informative designs (Graubard 
and Korn 1996). Though the topic is same, the work of 
Korn and Graubard is different from ours in many respects: 
a) As Pfeffermann et al. (1998), they consider only the 
linear multilevel model. b) In the context of the linear 
multilevel model, they focus on unbiased estimation of the 
variance components in small samples: in fact they propose 
some estimators for the variance components and only 
sketch how to derive similar estimators for the linear model 
with covariates, but without testing their performance. 
Anyway, the extension to nonlinear multilevel models is not 
trivial. c) The main estimators proposed by Korn and 
Graubard (2003), which are in closed form, showed good 
performance even in small samples. However they rely on 
the pairwise joint inclusion probabilities. When such proba- 
bilities are not available, which is often the case in practice, 
the authors propose a variant whose bias is substantial when 
the number of sampled clusters is moderate (33 in their 
simulation plan). In contrast, the PML method adopted in 
our work do not require joint inclusion probabilities. d) The 
informative design used by Korn and Graubard (2003) for 
their simulation study is quite different from ours: in fact, in 
their design the undersampling of the units depends on 
whether the model’s random errors are greater than a 
certain threshold in absolute value, while in our design the 
criterion depends on whether the random errors are high or 
low. Therefore a comparison of the results is difficult. 

The wide use of nonlinear multilevel models in many 
fields of application urges for a general and reliable 
weighted estimation method, which should be both effect- 
ive and simple to implement, preferably in the framework 
of a standard statistical software. The present paper 
represents a contribution in this direction. 

It is worth to note that the PML method we exploit is 
quite general, so it can be applied to a wide range of 
models. In the paper the focus is on models for ordinal and 
binary responses, since they are very common and can be 
represented as a linear model for the latent response 


endowed with a set of thresholds (see section 2), facilitating 
the comparison with the existing results for the linear 
model. However the description of the PML approach is 
absolutely general and the estimation technique based on 
the NLMIXED procedure of SAS (reported in Appendix A) 
is easy to generalize. 

The structure of the paper is as follows. Basic definitions 
for the multilevel ordinal model are set out in section 2, 
while in section 3 the general PML approach is described, 
along with some details for fitting the model using SAS 
NLMIXED. In section 4 the properties of the various 
estimators for the random intercept binary model are 
evaluated by a simulation study. Section 5 concludes with 
some final remarks. 


2. THE MULTILEVEL ORDINAL MODEL 


In order to ease the comparison with the results 
concerning the linear model (Pfeffermann et al. 1998; Korn 
and Graubard 2003), it is useful to write the ordinal model 
in terms of a latent linear model endowed with a set of 
thresholds. Suppose that an observed ordinal response 
variable Y, with k = 1,2,...,K levels, is generated, through 
a set of thresholds, by a latent continuous variable Y 
following a variance component model (Hedeker and 


Gibbons 1994): 

Y.= Bixest ouime (1) 
with i = 1, 2,..., N, elementary units (subjects) for the j-th 
cluster’Cj 172, 5, MP) in) Xj is a covariate vector and B 
is the corresponding vector of slopes; the random variables Ej 
and u, are the disturbances, respectively at the first 
(subject) and second (cluster) level; and @* is the second 
level variance component. 

For the disturbances of model (1) we make the standard 
assumptions, /.e., a) the E,,'S are iid with zero mean and 
unknown variance o7; b) the u,'s are Gaussian iid with 
zero mean and unit variance; c) the E,,'S and u,’S are 
mutually independent. 

Note that model (1) leads to the simplest case of a multi- 
level ordinal model, with just two levels and a single 
random effect on the intercept; the extension to three or 
more levels and to multiple random effects is straight- 
forward in principle (Gibbons and Hedeker 1997), but the 
complications in the formulae suggest to consider only the 
simplest case, which is sufficient for the discussion of the 
main conceptual issues. 

The observed ordinal variable Y is linked to the latent 
one Y through the following relationship: 


{Y;, ski} mies < Ve <7}; 
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where the thresholds satisfy -~©=y,)<y, <...<Yx1< 
Yx = +e. Therefore, conditional on Us the model proba- 
bility for subject i of cluster 7 is 


P(Y,,= klu,) =P,4< Ye < 7;,|u;) 
GLY Sit | tale PV, SY ayltt pho (2) 


with 


oO < y,|u;) = P(e, <¥,- [B’x,, + ou,] | u,) 


= B te [165,424] 
RMN G® eghcrinlp ia) Oe: 
= Flt, [Bo Xij* 0%) |)> (3) 


where F(:) is the distribution function of the standardized 
first level error term ¢€,./o. All the model parameters are 
defined in terms of the unknown o, the standard deviation 
of the first level error term, so only the ratios of the model 
parameters to the standard deviation of the first level error 
term are identifiable; we use the notation y, to indicate that 
the latent model parameter y is in o units, 7.e., y, = y/o. 
Note that F(-) is also the inverse of the link function of the 
ordinal model: for example, the standard Gaussian 
distribution function yields the ordinal probit model. 

As for identification, if B, includes the intercept, the 
estimable thresholds are K -2; so it is customary to set 
Yg,,; = 9. Alternatively, if the intercept is fixed to zero all 
the K - | thresholds are estimable. 

Now let @ denote the vector of all estimable parameters, 
which include a and K-2 _ thresholds 
{¥,¢k=2,--.K-1} (¥,, 18 fixed to zero to insure 
identifiability). The conditional likelihood for subject i of 
cluster j is 


K 
L,,(@|u,) = I [ P(Y,; = lu,) |" (4) 


where P( Y, =k u,) is defined by (2) and (3), while dix iS 
the indicator function of the event {Y;, =k}. Then the 


marginal likelihood for cluster j is 
N. 


L,(0) =[° TI £,@|wowdu, 


i=l 


where @ is the standard Gaussian density function. Finally, 
the overall marginal likelihood is 


M 
£(0) = |] £,(6). (5) 
jel 


95 


3. PROBABILITY-WEIGHTED ESTIMATION 
3.1 Pseudo Maximum Likelihood (PML) Estimators 


Suppose that the whole population of M clusters (level 2 
units) with N; elementary units (subjects or level | units) 
per cluster is not observed; instead the following two-stage 
sampling scheme is used: 


— first stage: m clusters are selected with inclusion 
probabilities T(J = 1,...,M); 


— second stage: n, elementary units are selected within 
the j-th selected cluster with probabilities 
m; ;(1= 1,....N;). 


The unconditional sample inclusion probabilities are 
then T;, = TM; T- 

When the sampling mechanism is informative, i.e., the 1; 
and/or the Ty; depend on the model disturbances and hence 
on the response variable, the maximum likelihood estimator 
of the parameters of the multilevel ordinal model defined in 
section 2 may be seriously biased. 

A standard solution to this problem is provided by the 
Pseudo Maximum Likelihood (PML) approach (Skinner 
1989). However in the context of multilevel models the 
implementation of the PML approach is complicated by the 
fact that the population log-likelihood is not a simple sum 
of elementary unit contributions, but rather a function of 
sums across level 2 and level | units. This can be seen by 
writing the logarithm of the likelihood (5) as follows: 


N. 
=> log L,(0 5 
i=l 


o(u)du. (6) 


M 
logL(@) =)" log | 
t=} ey 


A design consistent estimate of the population log- 
likelihood (6) can be obtained applying the Horvitz- 
Thompson principle, i.e., replacing each sum over the level 
2 population units j by a sample sum weighted by w, = 1/1, 
and each sum over the level | units 7 by a sample sum 
weighted by Wij = I /m; ;: 


log (@) = 
os wjlog{ 
J 


where »* denotes a sum over sample units. 

Note that inserting the weights in the log-likelihood 
implies the use of a design consistent estimator of the 
population score function. In fact, the population score 
function U(8)=0/00 log L(®) can be written as 


@(u)du, (7) 


exp is w, log L,,(0 o| 
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N, N, 
too an est D> < logh, I (u)du 
i i=l i=] 


For the ae GGL 0 = See necro) 
i +co My 
| = log ipewa 
se ei 


where Li = j(8|u), whose corresponding Horvitz- 
Thompson estimator U(@) is 


ails 


exp 6: w slots 3 Wij <logl, fra 
jens i 
fi exp is w; log 1, ow du 


i 


» (9) 


which equals the score obtained by differentiating the 
probability-weighted loglikelihood (7). 

Under mild conditions, the solution er to the esti- 
mating equations U(@)=90 is design consistent for the 
finite population maximum likelihood estimator 6 which, 
in turn, is model-consistent for the super-population para- 
meter @: therefore 6A is a consistent estimator of 8 with 
respect to the mixed design-model distribution 
(Pfeffermann 1993). 

Note that general probability-weighted estimators for 
nonlinear multilevel models can also be devised by 
weighting suitable estimating functions, as in the work of 
Singh, Folsom and Vaish (2002) in the context of small area 
estimation. 

The implementation of the PML approach requires the 
knowledge of the inclusion probabilities at both levels. 
Using only second level weights or only first level weights 
may be insufficient or may even worsen the situation, as 
shown by our simulations. 


3.2 Scaling the Weights 


A controversial issue discussed in Pfeffermann et al. 
(1998) and Korn and Graubard (2003) is the scaling of the 
weights to obtain estimators with little bias even in small 
samples. Obviously, scaling is not relevant for the level 2 
weights, since from (7) and (9) it is clear that multiplying 
the Ww; *s by a constant does not change the PML estimates 
(it simply inflates the information matrix by that constant). 
On the contrary, scaling the level 1 weights may have 
important effects on the small sample behavior of the PML 
estimator. In the simulation study discussed in section 4 we 
present the results for the following type of scaling (named 
‘scaling method 2’ in Pfeffermann et al. 1998): 

oo 7 W; \j (10) 
uJ w, i 


where Ww, 7 (O3: w;))/n;, so that, for the j-th cluster, the sum 
of the scaled weights equals the cluster sample size n,. In 
the present paper we do not wish to discuss the relative 
merits of the various scaling methods, so we limit our 
simulations to scaled weights (10), which have an intuitive 
meaning and showed good performance in the study of 
Pfeffermann et al. (1998), although they may yield a 
substantial bias with certain designs, as discussed in Korn 
and Graubard (2003). The topic will be broached again in 
section 4. 


3.3 Estimation Technique 


The maximization of the weighted log-likelihood (7) 
involves the computation of several integrals which do not 
have a closed-form solution, so a numerical approximation 
technique is required. When the dimensionality of the 
integrals is low, a simple and very accurate technique is 
Gaussian quadrature, which is based on a summation over 
an appropriate set of points. The NUMIXED procedure of 
SAS (SAS Institute 1999) is a general procedure for fitting 
nonlinear random effects models using adaptive Gaussian 
quadrature. Various optimization techniques are available 
to carry out the maximization; the default, used in the 
simulations of section 4, is a dual quasi-Newton algorithm, 
where dual means that the upgrading concerns the Cholesky 
factor of an approximate Hessian (SAS Institute 1999). 

Though the NLMIXED procedure does not include an 
option for PML estimation, it is still possible to insert the 
weights in the likelihood, using different tricks for level 1 
and level 2 weights, as explained in Appendix A. 


3.4 Variance Estimation 


In standard maximum likelihood the estimation of the 
covariance matrix of the estimators is obtained by inverting 
the information matrix. However this conventional esti- 
mator is not appropriate when using the PML method since 
it does not take into account the variability stemming from 
the sampling design. To get a more reliable covariance 
matrix Skinner (1989) proposed the use of a robust 
‘sandwich’ estimator, which is employed also by 
Pfeffermann et al. (1998). 

As noted in section 3.3, the NLMIXED procedure of 
SAS allows to fit the model with the PML approach, but the 
estimated covariance matrix, which is obtained by inverting 
the information matrix, is likely to be misleading in order to 
appreciate the actual variability of PML estimators. In the 
SAS framework the derivation of ‘sandwich’ estimators is 
not trivial. However, a simple and effective solution, 
requiring a bit of programming, is to empirically estimate 
the variance through the bootstrap technique for finite 
populations (Sarndal, Swensson and Wretman 1992), which 
consists of the following steps: a) using the sample data, an 
artificial finite population is constructed, assumed to mimic 
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the real population; b) a series of independent bootstrap 
samples is drawn from the artificial finite population and 
for each bootstrap sample an estimate of the target 
parameter is calculated; c) the bootstrap variance estimate 
is obtained as the variance of the observed distribution of 
the bootstrap estimates. 

The artificial finite population can be generated in the 
following way: 1) for the j-th sampled cluster, each of the n, 
sampled elementary units is replicated Wii times, rounding 
the weight to the nearest integer, obtaining an artificial 
cluster of about N. elementary units; ii) each of the m 
artificial clusters is replicated w, times, rounding the weight 
to the nearest integer, obtaining an artificial population of 
about M clusters. Then the samples are selected from the 
artificial population in the following way: i ) m clusters are 
resampled with probability proportional to 7,; 1i) for the 
j-th resampled cluster, Nn, elementary units are resampled 
with probability proportional to Tey j* 

When the sampling fraction m/M is low, most of the 
variance is due to .the sampling of the clusters, so the 
bootstrap procedure described above could be simplified by 
omitting the steps concerning the elementary units, i.e., step 
i) in the construction of the artificial population and step i1) 
in the resampling process. 

A simpler resampling technique for variance estimation, 
considered by Korn and Graubard (2003), is the jackknife. 
In the case of clustered designs the technique entails the 
calculation of the variance from the set of point estimates 
_ obtained by deleting one cluster at a time, though the 
performance of the jackknife with correlated data is not 
always satisfactory (Shao and Tu 1995). In our simulation 
study the jackknife variance estimator seems unreliable, so 
it is not used. Further research is needed to fully evaluate 
the potentialities of the jackknife by testing some suitable 
modifications of the technique. 


4. SIMULATION STUDY 


4.1 Design of Experiment 


The experiment reflects the two-stage scheme assumed 
for the observed variables: first, the finite population values 
are generated from the adequate superpopulation model 
(stage I) and then an informative or non-informative sample 
is selected from the finite population (stage II), with one 
sample per population. The two-stage selection scheme was 
repeated 1,000 times for each combination of sample size 
and type of informativeness. In order to compare our results 
with the ones obtained for the multilevel linear model, the 
experiment has been designed following the example of 
Pfeffermann et al. (1998, section 7). 


oF 


The simulation study focussed on a simple instance of 
the model defined in section 2, namely the random intercept 
probit binary model, which has only two categories for the 
response variable (i.e., K=2) and one cluster-level 
Gaussian random error. To parallel the study of 
Pfeffermann et al. (1998) the main simulation plan refers to 
the model without covariates, but some additional 
simulations are conducted to assess the performance of the 
estimators in the model with one cluster-level covariate and 
one subject-level covariate. 

The values of the binary response variable Y;, were 
generated using the following two-stage scheme which 
parallels the one of Pfeffermann et al. (1998): 


— Stage I. Finite population values Y, 
(j=1,..., Ms i= I,...,N;) were obtained by first 
generating a value from the Ss uy latent 
model Y= B+u,+e,, with wu, ~ M(0,o°) and 
ee O°), and eapittiins Y, =O if 1. <0 or 
Yo ia i> 0 (recall that the binary model has 
only one threshold which is set to zero to guarantee 
identifiability). The latent model parameter values 
employed in the simulation are B = 0, @? = 0.2 and 
o* = 0.5, so that the parameters estimable from the 
binary model are 8B, = B/o= 0 and w, = w/o = 0.632 
(see expression (3)). The hierarchical structure of the 
population comprises M = 300 clusters, while the 
cluster sizes N. were determined by NS 75 exp(u, yy 
with uw. generated from N(0, o°), truncated below 
by -1.5@ and above by 1.5@. As a result, in our 
population N, lies in the range [38, 147] with mean 
around 80. 

— Stage II. Once the finite population values were 


obtained, we adopted one of the following sampling 
schemes: 


(a) Informative at both levels: first, m clusters were 
selected with probability proportional to a 
‘measure of size’ X,, 1.€., 1; = mX,10j4X;; the 
measure X. was determined in the same way as N, 
but with ve replaced by Uj, the random effect at 
level 2. The elementary units in the j-th sampled 
cluster were then partitioned into two strata 
according to whether €,.>0 or €,.< 0 and simple 
random samples of sizes 0.25, and 0.75n. were 
selected from the respective strata. The sizes n, 
were either fixed, n, = No, or proportional to N.. 


(b 


VS 


Informative only at level 2: the scheme is the 
same as the previous one, except that simple 
random sampling was employed for the selection 
of level 1 units within each sampled cluster. 
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(c) Non-informative: the scheme is the same as the 
previous one, except that the size measure Xx; 
was set equal to N,. 


The simulation study included samples with m = 35 
clusters and varying numbers of elementary units: large 
samples with fixed size n,=n, = 38 and proportional 
allocation n, = 0.4N,, and small samples with fixed size 
Nn, =Ng =9 and proportional allocation n, = 0.1N; (mean of 
about 9). 

The simulation study was carried out entirely within the 
SAS System (SAS Institute 1999), writing specific code 
with the macro language. The models were fitted with the 
NLMIXED procedure (see Appendix A), using 10-point 
adaptive Gaussian quadrature with a dual quasi-Newton 
algorithm, which reached convergence in a few iterations. 
As explained in Appendix A, to avoid gross rounding errors 
the level 2 weights were pre-multiplied by a factor 
k = 10,000 and the estimated covariance matrix was then 
multiplied by the same factor. 


4.2 Results 


The results of the simulations are shown in Tables 1 and 
2. For each sampling design the behavior of the point esti- 
mators of the intercept B, and the second level standard 
deviation @, is summarized by the mean and standard 
deviation of their Monte Carlo sampling distribution. The 
point estimators under study are the standard maximum 
likelihood unweighted estimator and the following three 
weighted versions of it: a) cluster-level weighted: the 
weights are only at level 2 (i.e., varying w,’s and constant 
Wij ’s); b) unscaled fully weighted: the weights are at both 
levels and the level 1 weights are unscaled; c) scaled fully 
weighted: the weights are at both levels and the level 1 
weights are scaled according to (10), i.e., “scaling method 
2’ of Pfeffermann et al. (1998). 

Our results are shown and discussed according to the 
following three scenarios: 1) Base scenario: the sampling 
design is non-informative. In this situation all the basic 
assumptions underlying the random intercept binary model 
are fulfilled, so this case can be assumed as a benchmark 
for judging the subsequent results. 2) Informative/ 
Unweighted scenario: the sampling design is informative, 
while the estimator is unweighted. In this situation the basic 
assumptions underlying the random intercept binary model 
are violated because of the informativeness of the design 
and no adjustment is used. 3) Informative/Weighted 
scenario: the sampling design is informative and the esti- 
mator is weighted. Also in this case the basic assumptions 
underlying the random intercept binar model are violated, 
but the weights are introduced as a tentative adjustment for 
the bias of the standard estimator. 


4.2.1 Base Scenario 


When the sampling design is non-informative the 
standard maximum likelihood unweighted estimator is 
asymptotically unbiased (Tables 1 and 2: rows 9-12, 
column 1). However for small samples (n, =9 and 
n,0.1N;) there is an appreciable negative bias in the esti- 
mation of @,. 

If the weights are introduced when there is no need to 
adjust for the effect of the design (Tables 1 and 2: rows 9- 
12, columns 2-4), we face a slight increase in the variability 
of the estimators, which is more pronounced when the 
unscaled fully weighted estimator is used in small samples. 
Note that, still in small samples, the unscaled fully weighted 
estimator of @, is upward biased. 


4.2.2 Informative/Unweighted Scenario 


The informativeness of the sampling design produces 
biased and unstable estimates. The bias is still evident for 
large samples (Tables 1 and 2: rows 1-8, column 1). The 
conclusions are the same for both types of informative 
designs, though the bias tends to have a different sign. 
Moreover the informativeness of the design inflates the 
variability of the standard estimator with respect to the base 
scenario: in particular, when the design is informative at 
both levels the standard error of the estimator of B, is 
doubled. 


4.2.3 Informative/Weighted Scenario 


Estimation of ,. 

The results in Table 1 show that, when the design is 
informative, the weighted-based adjustment is effective in 
removing the bias in the estimation of B.. 

Particularly, when the design is informative only at level 
2 (Table 1: rows 5-8, columns 2-4) and the weights are 
introduced only at this level (cluster-level weighted 
estimator), the bias in the estimation is corrected with no 
important increase in the sampling variance. The result is 
valid also for fully weighted estimators (unscaled or 
scaled). The bias correction works for small samples too. 

When the design is informative at both levels (Table 1: 
rows 1-4, columns 2-4) and the weights are introduced at 
both levels (fully weighted estimators), the bias in the 
estimation of B. is corrected. Moreover, the fully weighted 
estimators have smaller sampling variance than the 
unweighted counterpart, except for the unscaled version in 
small samples. The scaled version is preferable especially 
in small samples, since it allows to achieve an unbiased 
estimator with a substantial lower sampling variance. It 
should be noted that when the design is informative at both 
levels, the cluster-level weighted estimator is worse than the 
standard unweighted estimator. 
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Table 1 
Simulation Means and Standard Deviations (in parenthesis) of Point Estimators of the Intercept (true value 0, number of replicates 1,000) 


Sampling design 


Unweighted estimator 


Weighted estimators 


Cluster-level Unscaled fully Scaled fully 


Informative at both levels 
Fixed size nj = 38 

Prop. size n; = 0.4N, 
Fixed size n, = 9 


Prop. size nj = 0.1N, 


Informative only at cluster level (level 2) 
Fixed size n, = 38 

Prop. size n, = 0.4N; 

Fixed size n, = 9 


Prop. size n,= 0.1N, 


Non-informative 
Fixed size n, = 38 
Prop. size nj = 0.4N, 
Fixed size nj = 9 


Prop. size n, = 0.1N; 


weighted 


-0.120 (0.212) 
-0.163 (0.212) 
-0.214 (0.204) 
-0.164 (0.220) 


0.281 (0.169) 
0.274 (0.169) 
0.274 (0.187) 
0.269 (0.179) 


0.000 (0.108) 
0.003 (0.113) 
-0.007 (0.108) 
-0.002 (0.110) 


-0.411 (0.202) 
-0.453 (0.200) 
-0.512 (0.190) 
-0.450 (0.209) 


0.018 (0.168) 
0.014 (0.178) 
0.010 (0.195) 
0.007 (0.179) 


0.000 (0.114) 
0.004 (0.120) 
-0.009 (0.115) 
-0.002 (0.114) 


weighted 


0.014 (0.193) 
0.018 (0.190) 
-0.062 (0.258) 
-0.074 (0.294) 


0.017 (0.170) 
0.014 (0.182) 
0.010 (0.212) 
0.007 (0.203) 


0.001 (0.115) 
0.003 (0.123) 
-0.010 (0.125) 
-0.004 (0.132) 


weighted 


0.015 (0.188) 
0.021 (0.183) 
0.000 (0.185) 
0.008 (0.203) 


0.017 (0.169) 
0.014 (0.181) 
0.009 (0.196) 
0.006 (0.182) 


0.001 (0.115) 
0.003 (0.122) 
-0.010 (0.117) 
-0.003 (0.117) 


Table 2 
Simulation Means and Standard Deviations (in parenthesis) of Point Estimators of the Second Level Standard Deviation 
(true value 0.632, number of replicates 1,000) 


Sampling design 


Unweighted estimator 


Weighted estimators 


Cluster-level Unscaled fully Scaled fully 


Informative at both levels 
Fixed size n= 38 

Prop. size n,; = 0.4N; 
Fixed size nj = 9 


Prop. size n, = O.1N, 


Informative only at cluster level (level 2) 
Fixed size nj = 38 

Prop. size nj = 0.4N; 

Fixed size nj = 9 


Prop. size n, = 0.1N; 


Non-informative 
Fixed size nj = 38 
Prop. size Nn, = 0.4N; 
Fixed size n, = 9 


Prop. size n; = 0.1N, 


0.671 (0.106) 
0.673 (0.108) 
0.644 (0.145) 
0.598 (0.164) 


0.595 (0.100) 
0.582 (0.096) 
0.547 (0.121) 
0.538 (0.122) 


0.611 (0.086) 
0.609 (0.084) 
0.561 (0.105) 
0.551 (0.109) 


weighted 


weighted 


weighted 


0.638 (0.112) 
0.636 (0.112) 
0.584 (0.172) 
0.546 (0.183) 


0.596 (0.110) 
0.582 (0.115) 
0.548 (0.135) 
0.535 (0.142) 


0.612 (0.092) 
0.606 (0.088) 
0.561 (0.112) 
0.546 (0.113) 


0.637 (0.137) 
0.645 (0.142) 
0.920 (0.289) 
1.002 (0.317) 


0.605 (0.111) 
0.603 (0.113) 
0.671 (0.144) 
0.696 (0.158) 


0.621 (0.090) 
0.626 (0.088) 
0.685 (0.119) 
0.703 (0.134) 


0.604 (0.128) 
0.592 (0.130) 
0.536 (0.222) 
0.498 (0.242) 


0.601 (0.111) 
0.596 (0.113) 
0.563 (0.133) 
Oot. 1359) 


0.617 (0.091) 
0.618 (0.088) 
O57 (0-211) 


0.559 (0.112) 


Estimation of @,. 

The results in Table 2, concerning ,, are more difficult 
to interpret (Table 2: rows 1-8, columns 2-4). First note that 
also in the base scenario the estimation of 0, is biased, 
especially in small samples. Therefore the weight-based 
adjustment should be judged as effective if it is able to 
reproduce the same bias which is observed in the base 


scenario. On these grounds the behavior of the scaled fully 
weighted estimator is satisfactory in nearly all situations, 
with the exception of the small samples when the design is 
informative at both levels. In that case there is also a not 
negligible number of replications which yielded a zero 
estimate for w, (4.5% for the design with fixed size and 2% 
for the design with proportional size). The unscaled fully 
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weighted estimator does not suffer from the problem of null 
estimates, but, apart from having a larger variance than the 
scaled version, tends to overestimate ®,, showing a relative 
bias of about 50% in small samples when the design is 
informative at both levels. Note also that the scaled fully 
weighted estimator outperforms the cluster-level weighted 
estimator even when the design is informative only at level 
be, 

4.2.4 Additional Simulations Using the Model with 

Covariates 


Some additional simulations were conducted to assess 
the performance of the scaled fully weighted estimator in 
the model with one cluster-level covariate and one subject- 
level covariate. The model is the same used in the main 
simulation plan, except for the inclusion of a covariate at 
each hierarchical level. For each covariate the values are 
generated from a standard Gaussian distribution, while the 
corresponding regression coefficient is fixed to 0.1. 

As shown by Tables 3 and 4, the scaled fully weighted 
estimator is effective in removing the bias induced by the 
informative design. Relative to the unweighted estimator 
the sampling variance is higher, especially for the subject- 
level regression coeffcient. Overall, the performance of the 
weighted estimator is satisfactory. 


Table 3 
Simulation Means and Standard Deviations (in parenthesis) of 
Point Estimators of the Regression Coefficient of the Subject-Level 
Covariate (true value 0.1, number of replicates 1,000) 


Sampling design Non Informative at both levels 
informative 
Unweighted Unweighted Scaled fully 
estimator estimator weighted 
estimator 


Fixed size nj, = 38 

Prop. size n, = 0.4N. 
j j 

Fixed size n= 9 


Prop. size n, = 0.1N; 


0.101 (0.028) 
0.099 (0.026) 
0.099 (0.055) 
0.098 (0.056) 


Table 4 


0.117 (0.040) 
0.117 (0.043) 
0.119 (0.083) 
0.116 (0.089) 


0.098 (0.050) 
0.098 (0.052) 
0.100 (0.104) 
0.098 (0.107) 


Simulation Means and Standard Deviations (in parenthesis) of 


Point Estimators of the Regression Coefficient of the Cluster-Level 


Covariate (true value 0.1, number of replicates 1,000) 


Sampling design 


Fixed size n,= 38 
Prop. size n= 0.4N; 
Fixed size nj = 9 


Prop. size n; = 0.1N; 


Non 
informative 


Unweighted 
estimator 


0.096 (0.119) 
0.102 (0.110) 
0.094 (0.117) 
0.094 (0.119) 


Informative at both levels 


Unweighted 
estimator 


0.117 (0.130) 
0.106 (0.133) 
0.116 (0.141) 
0.115 (0.144) 


Scaled fully 
weighted 
estimator 


0.102 (0.142) 
0.106 (0.142) 
0.105 (0.150) 
0.095 (0.158) 
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4.2.5 General Remarks 


Our simulations showed that the PML approach is, in 
most cases, a simple and effective strategy to deal with 
informative sampling designs. The only requirement is the 
knowledge of the inclusion probabilities at every stage of 
the sampling process (except when the informativeness 
does not concern all the levels). 

As for the regression parameters, the scaled version of 
the fully weighted estimator showed good performance in 
our simulations, achieving a low bias with a modest 
increase in the sampling variance (in some cases the 
variance even diminished). Even when weighting is 
superfluous, the loss of efficiency due to the inclusion of 
scaled weights is very low. 

While for the estimation of the regression parameters 
weighting seems to be always effective, for the variance 
component @, attention should be paid to the sample size: 
in fact, weighting leads to satisfactory results only when the 
cluster size is high, i.e., when it allows a good represen- 
tation of the complex variance structure. However the 
sample size is crucial in the estimation of @, also when all 
the basic assumptions of the multilevel ordinal model are 
satisfied. 

The differences induced by the type of clusters in the 
sample, fixed or variable size, are minimal, with equal sized 
clusters leading to slightly better estimators; however, as 
already noted, the important differences are largely due to 
the average size of the clusters in the sample. 

The results of our simulation study confirm the findings 
of Pfeffermann et al. (1998) on the random intercept linear 
model: probability-weighted estimators are good for the 
intercept, while some relevant bias remains in the esti- 
mation of the variance components when the sample is 
small. As was to be expected, when passing from a linear to 
a nonlinear model the performance of the estimators slightly 
worsen, but the direction and importance of the bias in the 
various cases are similar. Also the advantages of scaling are 
confirmed. 

The rise in the sampling variance due to the inclusion of 
the weights often has a magnitude which is in line with the 
results of Pfeffermann et al. (1998), though in some cases 
we found a reduction in the sampling variance, notably for 
the intercept when the weights are scaled and the design is 
informative at both levels. An interesting difference with 
respect to Pfeffermann et al. (1998) is the role of scaling in 
reducing the sampling variance: in this respect, scaling 
seems to be more effective in the binary model than in the 
linear model. 

As already noted, the critical point in the random 
intercept binary model is the estimation of the cluster-level 
variance @,, which represents a difficult task also when the 
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design is non-informative. Using the threshold formulation 
outlined in section 2, ig is defined as w/o, so estimation 
of @, involves the problems observed in the linear model 
associated with estimation of the two variance components. 
The simulations showed that the performance of the scaled 
weighted estimator of , is not entirely satisfactory in the 
case of small sample sizes. A possible way to improve the 
performance of the estimator 1s the adoption of a different 
scaling method. Korn and Graubard (2003) investigated the 
issue of scaling in the context of the linear model and 
warned that the scaling method here adopted (‘scaling 
method 2’ of Pfeffermann et al. 1998) may be badly biased 
under some designs, even if the sample size of clusters and 
sample sizes within the clusters are large. To get an idea of 
the extent of the bias we performed a short simulation study 
under the unfavorable scenario outlined by Korn and 
Graubard (2003), namely a simple random sample of 
clusters whose population sizes are all equal, and a simple 
random sample of individuals within each sampled cluster 
that is of size 2m or m/2 for a fixed m, depending on 
whether the observed variability of the individuals within 
the clusters tends to be large or small, respectively. In this 
case the scaled weights at subject level are all equal to 1, so 
weighting becomes ineffective. As a consequence, in the 
linear variance component model the within variance will 
be biased high. To see how this behavior extends to the 
random intercept binary model we simulated 1,000 datasets 
with 80 clusters and cluster sizes of 36 or 9 depending on 
whether the binomial variance of the responses of the 
cluster is over or under the median, respectively. Under the 
same superpopulation model as in the main simulations, the 
simulation means (and standard deviations) are -0.003 
(0.098) for B, and 0.451 (0.144) for o,. The cluster-level 
variance is heavily underestimated, though its value is not 
so far from the worst case of the main simulations (0.498 
under the informative design with Nn; = 0. IN;). Therefore, it 
seems unlikely to encounter situations where the bias is 
much greater than already shown by our simulations. 
Obviously, if estimation of the variance components is of 
primary interest it is important to improve the method, but 
this requires further research. 


4.2.6 Bootstrap Variance Estimation 


The estimated covariance matrix of the parameter esti- 
mates obtained by inversion of the information matrix, 
yielded by default by the NLMIXED procedure, is not 
reliable when using the weighted estimators to adjust for an 
informative design. For example, the estimated standard 
error of the scaled fully weighted estimator under the design 
informative at both levels with n, = = 0.4N, is 0.109 for B, 
(compared with a Monte Carlo value of 0. 183) and 0.089 
for @, (compared with a Monte Carlo value of 0.130). For 
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the other sampling sizes similar downward biases arise, so 
an alternative variance estimator should be devised. 

The bootstrap procedure described in section 3.4 has 
been applied to estimate the sampling standard deviations 
of the weighted point estimators of B, and @,. We limited 
the analysis to the scaled fully weighted estimator and to 
designs that are informative at both levels. To save compu- 
tational resources we implemented a bootstrap procedure 
which omits the steps concerning the elementary units, i.e., 
only the clusters are resampled. This procedure is expected 
to produce sufficiently accurate results, given the low 
sampling fraction (35/300) of the clusters (see section 3.4). 
Each simulation comprises 1,000 replications. For every 
replication the values of the response variable are generated 
through the two-stage scheme described in section 4.1 and 
200 bootstrap samples are selected. Table 5 reports, for 
each parameter, the Monte Carlo standard error of the 
sampling distribution of the scaled weighted estimator on 
1,000 replications of the complex design (see Tables 1 and 
2), the corresponding average bootstrap estimate and the 
relative bias. 


Table 5 
Simulation Standard Deviations of the Scaled Weighted 
Point Estimators of the Intercept and of the Second Level Standard 
Deviation and Corresponding Bootstrap Estimates 
(with 200 Bootstrap Samples Each) for Designs Informative 
at Both Levels (1,000 Replicates for Each Design) 


Sampling design 
Inform. Both levels 


B, , 


Boot. Relative 
Estim. error 


Simul. Boot. Relative Simul. 
s.d. Estim. error s.d. 


Fixed sizen,=38 0.185 0.175 -5.4% 0.124 0.106 -14.5% 
Prop. size n, = 0.4N, 0.183 0.173 -5.5% 0.140 0.129 -7.9% 
Fixed size n, = 9 0.200 0.167 -16.5% 0.234 0.599 156.0% 
Prop. size nj = 0.1N; ON95S O73 =11.3 9 0.247 5 0538 117.8% 


Due to the extremely long computational time, we 
limited our experiment to a specific bootstrap procedure 
based on only 200 bootstrap samples. Further work is 
needed to calibrate the number of bootstrap samples and to 
explore possible variants of the method. Nonetheless, the 
entries of Table 5 give some hints about the behavior of 
bootstrap estimators. 

The performance is better for the estimation of the 
sampling standard deviation of the estimator of B,, rather 
than of @,. Especially for , the sample size is the critical 
factor: for small cluster sizes (n, =9 and nj = 0.1N;) the 
bootstrap estimate is completely unreliable. On the contrary 
with large cluster sizes (n, = 38 and n; = 0.4N,) the results 
are quite good, since for both B, and @, the bootstrap 
produces a slight underestimation of the true variance. 
Note, however, that the bad performance of the variance 
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estimator for , is not as critical since Wald tests for 
variance parameters are not generally recommended in 
ordinary situations anyway. 


5. FINAL REMARKS 


The wide use of multilevel ordinal and binary models in 
many fields of application has motivated our study on the 
effects of complex sampling designs on the fitting of such 
models. In the paper we showed, by means of simulations, 
the bias induced by a two-stage complex sampling design 
on the fitting of a simple random intercept binary model 
when the clusters and/or the subjects are selected with 
probabilities that depend on the model’s random terms. The 
simulation study also showed that in such situations the bias 
can be reduced in an effective manner by the probability- 
weighted estimation procedure (PML) described in the 
paper, which is easily implemented in the SAS environ- 
ment. In particular, the scaled version of the weighted 
estimator achieved, for both fixed and random parameters, 
a low bias with a modest increase in the sampling variance. 
Even when weighting is superfluous, the loss of efficiency 
due to the inclusion of scaled weights seems to be very low. 

The application of the proposed methodology to real life 
examples requires an operational strategy which depends on 
the extent of the available information on the sampling 
design. Two extreme cases can be envisaged: a) for each 
stage of the sampling plan, the probabilities of inclusion 
and the adjustments for poststratification and nonresponse 
are exactly known; b) the information is limited to the final 
overall weights, which also include adjustments for post- 
stratification and nonresponse. 

In case a) the weights can be calculated at each sampling 
stage as the reciprocals of the product of sample selection 
probabilities and response probabilities given the sample 
selection, with a further correction for a possible poststra- 
tification. This is the idea behind the real life application 
presented in Pfefferman et al. (1998). 

In case b) the lack of information 1s critical, since, even 
in the absence of nonresponse and poststratification, it is 
not possible to disentangle the cluster-level and the 
(conditional) subject-level weights, at least without strong 
assumptions. As a result, weighted estimation cannot be 
performed. 

Between the two extreme cases just outlined there are 
many possible intermediate situations which require ad hoc 
solutions. For example, a common case arises when the 
researcher has access to the cluster-level inclusion proba- 
bilities (z.) and to the final overall subject-level weights 
(w;,), which also include adjustments for poststratification 
and nonresponse. When the poststratification and 


nonresponse affect only the subject level, then the 
subject-level (conditional) weights can be calculated as 
Wij = W,,°T;. Another more complex situation is described 
by Korn and Graubard (2003). 

A drawback of probability-weighted estimation is the 
need for special procedures to estimate the variability of the 
estimators. In our application we adopted a bootstrap 
technique, which is conceptually simple and easy to 
program, but requires some computational effort. Our 
limited simulation study suggests that its performance is 
good only for large sample cluster sizes; however more 
simulations would be needed to fully understand the 
behavior of the bootstrap estimator. 

Another open question is the choice of the most effective 
scaling method for reducing the bias of the estimator of the 
variance components when the sample size is small. 

The PML approach described in the paper is absolutely 
general and the estimation technique based on _ the 
NLMIXED procedure of SAS is easy to generalize to other 
nonlinear models. Therefore it would be of interest to assess 
the performance of the method in models other than the 
random intercept binary model here considered. 


APPENDIX A 


We report the SAS code used for implementing the 
probability-weighted (PML) estimators described in the 
paper. The essential part of the code is the NLMIXED 
procedure of SAS, which is a general procedure for fitting 
nonlinear random effects models using adaptive Gaussian 
quadrature. Though the NLMIXED procedure does not 
include an option for PML estimation, it is still possible to 
insert the weights in the likelihood, using different tricks for 
level | and level 2 weights. To insert level 1 weights it is 
necessary to exploit the option which allows to write down 
the expression for the conditional likelihood of the model: 
then one should simply translate in SAS programming 
statements the expression Wii log L; (O|u) (see section 
3.1). On the other hand, level 2 weights can be inserted in 
the likelihood through the replicate statement. 
Unfortunately, this statement is limited to integer weights, 
so to avoid gross approximations it is advisable to proceed 
as follows: a) inflate all the level 2 weights by an arbitrary 
constant k (equal to 10,000 in our application); b) insert the 
integer part of the inflated weights in the likelihood through 
the replicate statement; c) multiply the estimated 
covariance matrix by k by means of the cfactor option. 
This trick relies on the fact that multiplying the level 2 
weights by a constant has the only effect of inflating the 
information matrix by that constant, leaving the estimates 
unchanged. Anyway, when using the weighted estimation 
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method to adjust for an informative design the estimated 
covariance matrix of the parameter estimates is not reliable. 

In the following the SAS code is reported, where the 
symbols /* and */ include the comments: 


DBOe nimixed data=dataname Cjoxroabiaiie Sk 1G) 
Cractor— 07 OOO: 

fo CieaSeoe ds Cl ECoOmsSicainic imbibe oillyabiae, ielsS 
estimated covariance matrix of the parameter 
estimates */ 

jeans: ISV=S0) Sel Oe Sy 7/9 shake eu Avene / 
ipothaels gel SS OF 

eta=b0+randeff*sd; 

if (yobs=1) then z=probnorm(eta) ; 

else if (yobs=0) then z=1-probnorm(eta) ; 

if (z >le-8) then ll=log(z); else 11=-1e100; 
/*to avoid numerical problems if z becomes 
too small*/ 


Ie wi 2s o/s anelusion of Level 1 weights 
uy 


model yobs~general (ll); 
random randeff ~normal(0,1) subject=j; 
(Fo SME NemeluUisit erga Cemrd Palersay/ 


heplmGaten Wr ai * .aneluisdon. yor) dhevel 42 
weights (only integers) */ 
ods output ParameterEstimates=pe 
ConvergenceStatuS=cs; 
ie Ulin 
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Longitudinal Analysis of Labour Force Survey Data 


GEOFF ROWE and HUAN NGUYEN’ 


ABSTRACT 


The Canadian Labour Force Survey (LFS) was not designed to be a longitudinal survey. However, given that respondent 
households typically remain in the sample for six consecutive months, it is possible to reconstruct six-month fragments of 
longitudinal data from the monthly records of household members. Such longitudinal micro-data — altogether consisting of 
millions of person-months of individual and family level data — is useful for analyses of monthly labour market dynamics 
over relatively long periods of time, 25 years and more. 


We make use of these data to estimate hazard functions describing transitions among the labour market states: self- 
employed, paid employee and not employed. Data on job tenure, for employed respondents, and on the date last worked, 
for those not employed — together with the date of survey responses — allow the construction of models that include 
terms reflecting seasonality and macro-economic cycles as well as the duration dependence of each type of transition. In 
addition, the LFS data permits spouse labour market activity and family composition variables to be included in the 
hazard models as time-varying covariates. The estimated hazard equations have been incorporated in the LifePaths 
microsimulation model. In that setting, the equations have been used to simulate lifetime employment activity from past, 
present and future birth cohorts. Simulation results have been validated by comparison with the age profiles of LFS 
employment/population ratios for the period 1976 to 2001. 
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KEY WORDS: Microsimulation; Censoring; Truncation; Employment dynamics. 


1. INTRODUCTION 


In recent years, there has been increased recognition of 
the importance of studying labour market dynamics using 
individual level (micro-) data. For this purpose, new panel 
surveys have been developed, for example, the Survey of 
Income and Labour Dynamics (SLID) (Statistics Canada 
1998). But, existing LFS data (Statistics Canada 2002) pro- 
vides a virtually untapped historical resource, in the form of 
many fragmentary event histories. From a conventional 
standpoint, the data currently comprises a time series of 
more than 300 cross-sectional surveys that were conducted 
monthly over more than 25 years. However, from a 
longitudinal perspective, those same data consist of about 
6.5 million fragmentary event histories covering over- 
lapping time intervals within the past quarter century and 
totalling over 34 million person-months of observation. 

The analysis referred to in this paper was specifically 
directed towards development of hazard models to be 
incorporated in LifePaths (Statistics Canada 2001)-—a 
micro-simulation model of the Canadian population. 
Further details on the LifePaths model are available from 
the Statistics Canada website at www.statcan.ca/english/ 
spsd/index.htm. 

The paper is organized in the following way. In section 2, 
we discuss some features of LFS data when reorganized as 
longitudinal records and we present three examples com- 
paring estimates derived from the resulting longitudinal file 
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with corresponding estimates from other sources. In section 
3, we focus on the use of the data to model employment 
activity for LifePaths. There, we discuss the use of LFS 
micro-data in estimating hazard equations that describe 
employment dynamics. Finally, we present some illus- 
trations of estimation results and a validation of LifePaths 
simulations that make use of the hazard equations. 


2. LONGITUDINAL LES DATA: 
DISTINGUISHING FEATURES AND 
PROOF-OF-CONCEPT 


A longitudinal version of the LFS data was constructed 
by concatenating the monthly records of individual 
respondents into a file containing one record per respondent. 
Since an LFS respondent normally remains in the LFS 
sample for six consecutive months, we can obtain six-month 
histories for most respondents. These histories are not, by 
themselves, long enough for most longitudinal analyses. 
However, given the overlapping rotation groups that are part 
of the LFS design, these six-month fragments may be used 
in analysis of the experiences of employment cohorts over 
decades. (In line with the focus of the analysis below, we 
use the term “cohort” to refer to a relatively homogeneous 
group for all of whom a specified initial event has occurred. 
Thus, an “employment cohort’ might refer to all persons 
who started a new job within a specified time period or, 
more narrowly, to all of those who started their third job 
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within a specified time period. The data available from the 
LFS determines how narrowly such a cohort can be defined 
here). 

Figure 1, which illustrates some characteristics of the 
LFS data after they are formed into longitudinal records, 
focuses on changes in employment status for the employ- 
ment cohort who started a job in January 1976. Respon- 
dents who were members of this cohort and who entered the 
sample through rotation 1 contribute data on the first six 
months, from January 1976, when the job started, to June 
1976, when they left the LFS sample. For respondents from 
rotation 2, the six-month longitudinal data window shifts 
right one month (starting and ending one month later than 
those given by rotation 1). The overlapping data windows of 
respondents from subsequent rotations evolve similarly. 
Thus, the longitudinal LFS data can be seen as a combi- 
nation of overlapping sets of panel data, in which re- 
spondents from the same rotation constitute a conventional 
data panel. 

Successive six-month fragments of longitudinal LFS data 
can be combined to provide successive estimates of 
cumulative attrition from an initial employment cohort and, 
further, to identify new cohorts defined in terms either of a 
new job or of a period without employment. Thus, over the 
long term (currently up to 25 years), many different samples 
of individuals can contribute information about the same 
employment cohort observed at different points in time. 


Respondent from rotation 1 
Respondent from rotation 2 
Respondent from rotation 3 
Respondent from rotation 4 
Respondent from rotation 5 


Respondent from rotation 6 


Even so, month-to-month changes are observed largely 
from the same sample of individuals. The two shaded areas 
in Figure | illustrate this. The respondents from each of the 
rotations 2-5 contribute data for both the May-June and the 
June-July intervals. 

This is not the first attempt to use LFS data longitu- 
dinally. Stasny (1986) and Lemaitre (1988) studied errors in 
the estimation of “gross flows” between labour force states 
(employed, unemployed and not in the labour force) over 
intervals of one month. Lemaitre found that problems arose 
both because of response errors and because “Labour Force 
Survey concepts, designed for cross-sectional purposes, tend 
to “create” flows when consecutive months’ responses are 
linked”. (Examples include the treatment of on-call workers 
and of the self-employed without a business). Nevertheless, 
he concluded, ‘“‘Administrative data have shown that not all 
sub-groups of status changers are seriously overestimated”’. 
Kinack (1991) examined the longitudinal consistency of 
responses to questions on job search activity that were used 
to distinguish between the categories unemployed and not in 
the labour force. He found substantial inconsistency, 
particularly when associated with proxy responses from 
different proxy respondents. These studies have shown that 
focusing on transitions between the categories employed and 
not employed (i.e., without distinguishing between 
unemployed and not in the labour force) could help reduce 
the impact of response error. 


Ree 88s 8 sr > 


Jan-76 Feb-76 Mar-76 Apr-76 May-76 Jun-76 Jul-76 Aug-76 Sep-76 Oct-76 


Legend 


Not employed 
Subsequent jobs 


Jobs started in January 1976 


Numbers show job tenure (months). Arrows indicate the status continues at time of exiting the LFS sample 


Figure 1. Illustration of LFS fragmentary data on cohort starting jobs in January 1976 
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Cross-sectional LFS data have previously been used to 
estimate frequencies of job hiring and job separation over 
monthly intervals (Lemaitre, Picot and Murray 1992). In 
that case, hiring was directly observed from the frequency 
of reported job-tenures of one month or less, while sepa- 
ration was determined residually using aggregate estimates 
of employment change together with the estimates of hiring. 
Cross-sectional LFS data have also been used to calculate 
and compare duration statistics for synthetic-cohorts. For 
example, Corak and Heisz (1995) use retention rates from a 
single time interval to represent a hypothetical cohort’s 
experience. Synthetic-cohort retention rates were obtained 
using the numbers of employed LES respondents reporting 
job tenure “ft” in month “m” together with those reporting 
tenure “t+ 1” the next month. Such uses of cross-sectional 
data have certain limitations. In particular, because the 
movement of individuals is not directly observed, desti- 
nation states are unknown. (Although we may estimate the 
proportion that separated from a job, we can not estimate the 
proportion of those that became unemployed rather than 
dropping out of the labour force or beginning another job 
immediately). Nevertheless, a time series of synthetic-co- 
hort statistics — for example, the proportions of jobs that 
might last a certain duration — can serve as an index that is 
sensitive to changing labour market conditions. 


2.1 Proof-of-Concept: Selected Examples of 
Longitudinal Data Validation 


The LES data were not intended to be used longitudinally 
and problems can arise with such use (Stasny 1986; 
Lemaitre 1988; Kinack 1991). Consequently, it is important 
to verify, for each analysis individually, that valid estimates 
can be obtained by month-to-month comparison of 


107 


longitudinal responses. We present three examples of the 
verification of LFS longitudinal estimates below. In Figure 
2, we compare estimates of the annual number of job 
separations in Canada from 1976 to 1995 (separations of all 
types, permanent and temporary) based on LES data and on 
administrative data. The latter are based on Records of 
Employment (ROE) issued by employers at the time of job 
separation for Employment Insurance purposes (Statistics 
Canada 1998). 

As may be seen, the number of transitions determined by 
month-to-month comparison of LFS data corresponds 
closely to the number from ROE data. Still, there are dif- 
ferences between the two series. Some of these differences 
could arise because of differences in coverage between the 
LFS and administrative data, as well as periodic changes in 
the LFS design or questionnaire. Another source of dif- 
ference could arise because our counts based on LFS data 
neglect job separations of multiple job holders who 
remained employed in at least one job (i.e., we counted only 
main-job changes). Nevertheless, we regard the degree of 
agreement between the LFS and administrative data as close 
enough to justify further analysis of the LFS micro-data. 
Both data sources imply that the annual rate of job 
separations was high: based on ROE data between 1978 and 
1995, the average annual job separation rate for males was 
over 38 percent of annual person-jobs. Further analysis of 
the LFS micro-data can shed light on these dynamics. 

Figure 3 goes further in the validation of employment 
dynamics, comparing “job survival” probabilities for males 
and females who started a job in 1993, as estimated from the 
LFS data and from SLID. (Note that 1993 corresponds to 
the first year of SLID data). 
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Figure 2. Estimates of Annual Job Separations 
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Figure 3. ‘Job Survival’ Probabilities of the Cohort Starting Jobs in 1993: Comparison of Estimates Based 
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Figure 4. Estimates of Births in Canada by Quarter, 1976 - 2001 


The “job survival” probabilities were estimated from 
LFS data by the chained product of average retention rates 
derived from monthly main-job separation rates over the 
period 1993 to 1998. Survival probabilities from the SLID 
data were estimated in a similar manner using the reported 
job tenure and dates of job end. Both survival curves display 
the same characteristic shape; showing relatively high 
attrition for jobs of duration less than a year, but with much 
lower attrition rates at job tenures of one to five years. There 
are discrepancies between the estimates for durations of 
about six months or less, which may be related to the one- 
year recall period of SLID interviews and to the restriction 
of LFS job-tenure data to main-jobs. However, over periods 
as long as five years, the LFS and SLID provide very 
similar estimates. And, with the available LFS data, we can 
track some employment cohorts for as long as 25 years after 
the employment spell began. 

A final illustration of effective longitudinal use of LFS 
data involves month-to-month comparison of the number of 


children aged less than one year as reported by female 
economic family heads or by the spouse of a male head. A 
infant child that is newly reported by a woman aged 
between 15 and 50 likely signifies the birth of a child. In 
order to make direct comparisons between these LFS 
estimates and vital statistics, we made some straight- 
forward adjustments to account for the proportion of births 
occurring to other women living in economic families (e.g., 
teen lone parents living with their parents) and for births in 
the Yukon, NWT and Nunavut. A comparison of the 
resulting LFS monthly estimates of births with the 
corresponding counts of births registered in vital statistics 
(Figure 4) demonstrates that the LFS estimates follow 
secular trends in fertility as well as capturing some of the 
month-to-month fluctuation in births. Taken together, these 
three examples indicate that— with careful attention to 
survey coverage, survey concepts and the possibility of 
response error — the LFS can provide useful longitudinal 
micro-data. 
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3. USING LONGITUDINAL LFS MICRO-DATA 
FOR MODELING EMPLOYMENT ACTIVITY 
IN LIFEPATHS 


This section focuses on the use of the LFS data to 
simulate employment activity in LifePaths. Currently, 
LifePaths uses a 3-category classification of employment 
status -employee (E), self-employed (SE), and not em- 
ployed (NE). We have not analyzed transitions involving 
unemployment. (Unemployment is a complex state re- 
quiring additional questions to ascertain and so, as noted 
above, unemployment transitions are particularly subject to 
response error). 

There are six transitions that can result in a change in 
employment status (as represented in Figure 5). LifePaths 
models all of these transitions. In addition, job changes that 
do not appear to involve an interruption of employment are 
also modeled by LifePaths (denoted here as E => E). The 
LFS micro-data were used to estimate hazard equations for 
each of these seven transitions. The estimated coefficients of 
these equations became parameters in the LifePaths “Career 
Work” module. Below we discuss some technical issues 
that arise due to the limitations of the LFS data, followed by 
an illustration of the estimation results and then of a 
simulation outcome. 

The fragmentary nature of these data poses a challenge 
for analysis. An important question is whether there are 
unavoidable biases that result from their fragmentary nature. 
In general, the answer is that the limitations of these data 
can be accounted for and potential sources of bias can be 
avoided with careful analysis. 


3.1 Censoring and/or Truncation of Event Histories 


One source of concern for an analyst of these data is the 
absence of retrospective employment information other than 
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the length of the current employment spell. We might think 
of individual employment histories as consisting of a 
(largely unobserved) succession of contingent employment 
states (illustrated in Figure 6) with transitions among these 
states reflecting the process of career development. Thus, 
given only the transitions observable within the LFS 
window, the transition rates that can be estimated will 
inevitably involve pooling data from respondents who have 
had markedly different prior careers. In contrast, panel 
surveys like SLID, collect retrospective data at the first 
interview that, although limited, at least permits some 
experience rating of respondents in terms of previous 
extended work interruptions or periods of part-time work. 

Another concern, illustrated in Figure 6, is that LFS 
employment spell durations may be left-truncated and/or 
right-censored. Right-censoring refers to the circumstance in 
which a spell ceases to be observed or a respondent ceases 
to be at risk without a transition occurring of the type being 
studied. This happens either (1) because the respondent’s 
household “rotated out” of the LFS sample before any 
transition occurred, or (2) because another transition 
occurred that was not of the type under active study. 
Similarly, these data are frequently left-truncated. This 
refers to the circumstance in which the beginning of a spell 
is unobserved, because it happened before the respondent’s 
household “rotated in” to the LFS sample. (These data are 
left-truncated rather than left-censored, because respondents 
provide the information necessary to determine the elapsed 
duration of the current spell at the time of the first 
interview). Since both truncation and censoring are gen- 
erally independent of employment event processes, neither 
should lead to bias in the estimation of transition 
probabilities, if properly accounted for in the likelihood 
function. 


< Employee Be 


(E) 


Self-Employed 
(SE) 


Not Employed 
(NE) 


Figure 5. Employment Status and Transitions in LifePaths 
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Figure 6. Recurrent Events and Employment Spell Durations Observable within the LFS Sample Window 


The combination of full and partial information provided 
by left-truncated and right-censored data can be represented 
in a conditional likelihood (Wang 1991). In a competing 
risks framework, the likelihood of an employment transition 
type j involving respondent i may be expressed in terms of 
the spell duration observed k months after i was first 
observed to be at risk of transition j. Let t; denote the year 
and month of the LFS interview in which i’s current 
employment state was first observed (i.e., often the first 
interview). Based on information collected at each 
interview, we can determine the length of the current spell 
of employment or spell not employed (m, ). Then 
+k =m, +k would denote the elapsed spell duration in 
the state as assessed k months after the first observation — 
assuming no intervening events — and the likelihood of a 
transition of typej (i.e., L;,,, ) can be expressed in terms of 
m,,,- Terms in the likelihood function comprise: the 
probability density of durations leading up to transitions of 
type j ( f;(m, ,,)), the corresponding cumulative probabil- 
ity (F j (M44) ), a binary variable indicating whether or not 
censoring has occurred (C,, ,, ), and a further binary vari- 
able indicating whether or not the current spell was left- 
truncated (LT;, ). Note that, in the competing risks frame- 
work, the density f;(m, ,,) relates to a latent variable — the 
waiting time leading specifically to transition j — and that we 
must assume there is one such density for each competing 
event. In principle, the completed spell duration (observed 
when a transition occurs) will correspond to the minimum 
of competing, latent waiting times. 

To account for left truncation, the likelihood is expressed 
in terms of conditional probabilities given the spell duration 
first observed (m, ): these probabilities take the form either 
of conditional probabilities evaluated at the time of an 
observed transition (f;(m,,,!m,.)) or of conditional 
probabilities of surviving — without the occurrence 
specifically of transition j—to the observed duration 
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This likelihood accounts for all of the information we 
have regarding the specific risk of transition 7 and can 
incorporate the effect of other competing risks by treating 
them as censoring events that are in addition to censorship 
by “rotating out’ of the sample. Competing risks problems 
are commonly formulated in terms of such latent waiting 
times, especially in epidemiology and biostatistics, but also 
in economics (e.g., Heckman and Honoré 1989). However, 
while providing a mathematically convenient motivation for 
the likelihood, the approach has been criticized “on the basis 
of unwarranted assumptions, lack of physical interpretation 
and identifiability problems” (Prentice, Kalbfleisch, 
Peterson, Flournoy, Farewell and Breslow 1978). 

The conditional likelihood (1) can be approximated by a 
Poisson likelihood (Holford 1980; Laird and Olivier 1981), 
thereby also acknowledging the discreteness of the data (Z.e., 
transitions are generally “observed” in the one month 
interval between successive interviews). Equation (1) can be 
re-expressed in terms of a binary variable (Y,,,,) that 
represents occurrence or non-occurrence of a transition in a 
particular time interval (note that Y,,,, = 1-Cj,4,). 
Then, Y,,,, 18 treated as a Poisson random variable having 
an expected value equal to the hazard “h,,,,” which is 
assumed piecewise constant. Under this model, the contri- 
bution from 7 to the log-likelihood over n periods (using 
Aiesk = f;(m, 4.) /A- Fj (m, ,,)) = —d In(1 — F;(m, ,,)) f Om, 
together with (1)) is approximately: 
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It is common practice to account for a complex survey 
design by means of a “pseudo” likelihood that incorporates 
the survey weight. Maximizing the “pseudo” likelihood 
corresponds to minimization of a weighted sum of deviance 
terms (i.e., terms representing the difference between 
estimated likelihood contributions and their maximum 
possible values). Thus, the full-sample, conditional log- 
likelihood for transition j7 may be transformed into a 
weighted deviance D; (note that W is derived from the 
survey weights and, since transitions are typically identified 
by comparing employment states between interviews, we 
use averages of consecutive cross-sectional survey weights 
to obtain W ): 
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In the analysis of each transition type j, we treat other 
events (i.e., non-j events occurring to the same population- 
at-risk) as censoring, and so the deviance for a set of such 
events will be the sum of component deviances (i.e., if the 
overall hazard is the sum of competing hazards, then the 
competing risks may be treated as independent (Prentice 
et al. 1978)). 

A more direct motivation of the same deviance takes 
Poisson processes as its starting point (Borgan 1984; 
Andersen 1985; Andersen and Borgan 1985; Lawless 
1987), rather than starting with postulated event-specific, 
latent, duration densities like f;(m,,,). In this case, we 
can model sampled multivariate counting processes that 
represent the number of occurrences of each specific 
transition in time intervals [f, 4. Sample counting 
processes, represented by the step functions in Figure 6, are 
observable counterparts of cumulative hazard functions. The 
assumption that the underlying hazard functions are ap- 
proximately piecewise constant leads directly to the Poisson 
deviance as an approximation (Lindsey 1995). To limit bias, 
the principal concerns are that the population-at-risk can be 
identified, that censoring or truncation mechanisms are 
conditionally independent of the underlying employment 
processes and that the intervals over which hazards are 
assumed constant are not too large. 

It is possible to obtain simple averaged estimates of 
employment hazard functions (such as those displayed in 
Figure 3) by implicitly splicing together all available 
information on members. of a defined cohort from the 
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longitudinal LFS samples. (That is, maximizing likelihood 
(1), but without considering any covariates). Making 
allowance for censoring and truncation in this way is a 
relatively simple example of such problems compared with 
the more complex observation schemes considered by 
Alioum and Commenges (1996). This implicit splicing of 
information is apparent in the deviance (3) which has two 
components: the first component is non-zero only at 
observed transitions, while the second component reflects 
the weighted differences between cumulative events and 
cumulative hazards (accumulated over all durations prior to 
the events or to censoring times). To the extent that the LFS 
cross-sections are representative samples for each reference 
week, then — taken together — they will provide an accurate 
estimate of the numbers of events occurring over the “life” 
of an employment cohort. Similarly, within samples from 
employment cohorts, we can expect to find left-truncated 
and right-censored respondent spells that might fill-in the 
missing prior histories of those left-truncated spells that 
terminate with a transition. As such, the first component of 
the deviance will accurately reflect whether hazard 
estimates tend to be large over periods where observed 
events are frequent. And the second component, summed 
over all respondent-months, may have a value similar to that 
which we might have obtained had there been no left- 
truncation. So, for data as extensive as these, the conditional 
likelihood may be almost equivalent to an unconditional 
likelihood. 


3.2 Estimating Employment Transition 
Hazard Equations 


Patterns of employment transition differ significantly 
among different demographic groups. For example, full- 
time students are most active in the labour market during 
their summer break, whereas the maternity leave that an 
employed pregnant woman takes may be largely deter- 
mined by Employment Insurance regulations. Accordingly, 
LifePaths distinguishes among the following groups and 
models their employment activities separately: 

— Those who are full-time students; 

— Those who have just graduated or left school and are 

in a transition to an after-school job; 

— Pregnant women for whom a maternity-leave may 

apply; 

— Those who are in prime ages of employment; and 

— Older workers in transition to retirement. 


We discuss here only the estimation for the fourth group, 
comprising individuals who are in what is referred to in 
LifePaths as their “career employment” phase (the most 
important phase in terms of impact on the economy). 
Particulars for the other groups are available from the 
Statistics Canada website noted above. 
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For implementation in LifePaths, our hazard model uses 
a log-linear form of regression equation — one equation for 
each of the 7 transitions and for each sex separately, giving 
a total of 14 equations: 


EY, 0. )* h yes , A (4) 
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where FE () is the expectation operator, g(m) is a log-linear 
spell duration spline, X is a vector of time-varying 
covariates and f is a vector of regression coefficients. The 
term g(m) corresponds to a piecewise Weibull baseline 
hazard, which, in our specification, distinguishes employ- 
ment transition risks at durations of less than a year from 
risks at durations of more than a year. The covariates, X, 
include variables representing individual age, education, 
province of residence, presence of children by age group, 
spouse’s employment status, calendar month and calendar 


year, as well as interactions among some of these factors. 
Final estimates of B and g(m) minimize the deviance (3). 
The only example of detailed results that we present here 
involves the mutual influence of husband’s and wife’s 
employment status on each other’s respective transition 
hazards. Figure 7 compares coefficient estimates from the 
seven equations that correspond to the seven transitions we 
specified. The two panels correspond to the separate sets of 
equations for males and females. The category “no spouse 
present” was treated as the reference category and the 
spouse’s employment status was classified into “with paid 
employment’, “self-employed”, and “not employed”. The 
estimated coefficients are presented here in terms of risk 
relative to the reference group. Thus, with other covariates 
controlled, the hazard of becoming self-employed for 
female employees whose husbands are self-employed is 
about 2.5 times higher than the hazard of their counterparts 
who do not have a spouse (see tallest bar in the top panel). 
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Figure 7 shows that the very presence of a spouse can 
work in opposite directions for males and females. The most 
frequent transitions for both sexes are E => E, NE => E and 
E => NE. For females, the first two of those transitions are 
less likely to occur to married women than to single women, 
while the transition to “not employed” is more likely. (The 
presence of children is not the reason for this, as their 
presence is accounted for by other terms in the equation). 
For males, the pattern is reversed. Thus, these results appear 
consistent with conventional gender roles. However, taking 
account of the magnitudes of these relative risks, we are not 
given the impression that gender roles have a particularly 
strong influence after the influence of other variables is 
credited. 

Figure 7 reveals another conspicuous pattern. First, the 
relative risks of a transition into self-employment, for 
spouses with husbands/wives in self-employment, stand out 
as the highest among all other transitions. In addition, 
spouses with husbands/wives in self-employment have the 
lowest relative risks of a transition out of self-employment. 
Thus, self-employment status seems to be mutually re- 
inforcing within families. These observations are consistent 
with forms of joint self-employment involving a family 
business (e.g., a corner store) or involving endogamy among 
professionals (e.g., lawyers marrying other lawyers). 


4. FROM ESTIMATED PARAMETERS TO THE 
SIMULATION RESULTS: AN ILLUSTRATION 


Our example of the role of spouse’s employment status 
points to the need for family context in the simulation of 
employment activities. It is a challenge for LifePaths to 
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integrate these relationships into the simulation process. For 
example, if individual education progression or the effects 
of education on employment transitions are not modeled 
appropriately and accurately, then the consequences will 
cascade from direct education-employment relationships to 
a chain of indirect impacts, involving relationships between 
education and marriage, fertility, interprovincial migration, 
etc. These impacts would then spill over to the simulated 
spouse, as indicated above. It is not difficult to see that, 
unless these relationships are specified appropriately and the 
parameters are estimated with reasonable accuracy, bias 
would be spread over a wide range of simulated outcomes. 

An overall validation of the LifePaths employment 
hazard equations was obtained by comparing simulated 
annual average employment/population ratios with direct 
cross-sectional estimates from the LFS. The simulated 
employment/population ratios were obtained from a syn- 
thetic population whose members were exposed appro- 
priately to one or other of the seven types of employment 
hazards over the course of each simulated year. The sim- 
ulated employment/population ratios were calculated from 
the resulting annual person-years of employment in the 
synthetic population: that is, these ratios are an outcome of 
simulated flows into and out of employment. The sim- 
ulations necessarily involved generating appropriate distri- 
butions of covariates that in turn determine the distributions 
of employment transition hazards. As may be seen in Figure 
8, LifePaths accurately reflects the age patterns of female 
employment in both 1976 and 2001 and correspondingly 
accounts for the dramatic change observed in those age 
patterns over the past quarter century. 
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Figure 8. Validating hazard equations using LifePaths 
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5. CONCLUSIONS 


We have demonstrated that the LFS data — when 
organized into the fragmentary event histories collected over 
the six-month periods that most respondents spend in the 
sample — represents a significant longitudinal micro-data 
asset. There is sufficient sample and breadth of content to 
provide for important analysis of labour market dynamics 
and, conceivably, of demographic processes such as 
fertility. Moreover, the data is monthly and spans more than 
a quarter century, so that analysis based on it has 
uninterrupted time depth that is unique in Canada. 

In our main application (employment transitions), other 
results (not reported here) appear to confirm the influence of 
a range of explanatory variables on an individual’s chances 
of an employment transition. These covariates include age, 
job tenure (or duration not employed), educational attain- 
ment, presence of young children (especially for women), 
province of residence, seasonality, and business cycles. 
However, this work is still in its initial stage and, to date, our 
approach to inference has been informal. Future work will 
involve extending and refining our models and establishing 
a more rigorous basis for evaluation of the models. 
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Contact and Cooperation in the Belgian Fertility and Family Survey 


MARC CALLENS and CHRISTOPHE CROUX' 


ABSTRACT 


Combining response data from the Belgian Fertility and Family Survey with individual level and municipality level data 
from the 1991 Census for both nonrespondents and respondents, multilevel logistic regression models for contact and 
cooperation propensity are estimated. The covariates introduced are a selection of indirect features, all out of the 
researchers’ direct control. Contrary to previous research, Socio Economic Status is found to be positively related to 
cooperation. Another unexpected result is the absence of any considerable impact of ecological correlates such as urbanity. 


KEY WORDS: Nonresponse; Multilevel analysis; Fertility and Family Survey. 


1. INTRODUCTION 


The aim of this paper is to empirically assess the relative 
importance of correlates of contact and cooperation rates in 
the Belgian Fertility and Family Survey (FFS Belgium 
noo I’): 

The conceptual and theoretical nonresponse framework 
used in this paper has been proposed by Groves and Couper 
(G&C 1998). In their view, nonresponse arising from 
noncontact is directly influenced by survey design features 
such as the number and the timing of calls. Conditionally on 
these survey design features, other important features such 
as physical impediments of the housing units and 
accessible-at-home patterns of the would-be respondents, 
which are indirectly measured by various social environ- 
mental and socio-demographic attributes, also play an 
important role. The decision to cooperate or to refuse is 
primarily regarded as a direct function of a dynamic social 
communicative process between the interviewer and the 
interviewee. Survey design, main interviewer, sample 
person and social environment characteristics are consi- 
dered to have only an indirect influence on cooperation 
rates. 

We use both individual level and municipality level data 
from the 1991 Census data, matched to the fieldwork 
outcome variable for nonrespondents and respondents of 
the 1991 Belgian FES. In this survey, individuals are the 
sampling units. It is a face-to-face survey with low 
noncontact (4%) and moderate refusal rates (22%). We 
consider our data to be hierarchically nested with sample 
units at the lower and municipalities at the higher level. 
Including covariates at both levels, multilevel logistic 
regression models for contact and cooperation propensity 


are estimated. The covariates are a selection of indirect 
features, all out of the researchers’ direct control. 

Some intriguing results are: (1) Socio Economic Status 
indicators like education are positively related to coop- 
eration and (2) ecological factors including urbanicity are 
not correlated with nonresponse. This is in contrast with 
findings from previous US-based research. 


2. A THEORY FOR CONTACTABILITY 
AND COOPERATION 


The process of realising an interview consists of two 
major components: the process of contacting a sample 
person and dependent on contact, the process of co- 
operation with a survey request. An attractive multi-level 
theoretical framework for studying contactability and 
cooperation has been proposed by Groves and Couper 
(G&C 1998). 


2.1 Contactability 


Chronologically, the process of contacting a sample 
person comes first. Some sample persons are never 
contacted by interviewers and hence never make a decision 
about their survey cooperation. Relative to the process of 
cooperation, the process of contacting a sample person is 
quite simple. 

G&C (1998) consider contactability to be a function of 
three factors: (1) whether there are any physical 
impediments that prevent interviewers to get in touch with 
the sample person, (2) when sample persons are at home 
and (3) when and how many times the interviewer tries to 
contact the sample person. The number and timing of calls 
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by the interviewer and the accessible-at-home patterns of 
the sample persons are the proximate causes of contacta- 
bility. The accessible-at-home patterns of the sample person 
are affected by the presence of physical impediments (e.g., 
telephone presence), socio-demographic attributes (e.g., 
commuting times) and social environmental attributes (e.g., 
crime). Also survey design features such as the length of the 
data collection period and the interviewer workload might 
have an influence on contact rates. 


2.2 Cooperation 


The central question in the survey stage following 
contact is why sample persons do or do not cooperate with 
the interviewer request. In the Groves-Couper model to 
study cooperation, the proximate causes of the decision to 
cooperate or to refuse lie at the level of the householder and 
his or her interaction with the interviewer. Another 
component in the theoretical framework of G&C (1998) is 
the set of survey design features, such as: the agency of data 
collection, advance warning of the survey request, topic 
saliency, etc. 

G&C (1998) consider also two factors that are out of the 
control of the survey designer: influences of the sample 
person and social environmental influences. These variables 
are not considered to be direct causal influences on 
cooperation, but indirect measures of what are essentially 
social psychological constructs. Important theoretical 
constructs in this respect are: opportunity costs, social 
exchange and social isolation. 


2.2.1 Opportunity Costs 


The notion of opportunity costs implies that sample 
persons weigh the opportunity costs in agreeing to spend 
their time responding to a survey interview. An important 
ingredient in the opportunity costs theory is the amount of 
discretionary time for the sample person available to 
complete the survey. Those with less discretionary time are 
less likely to feel free to participate in a survey. Some 
indirect indicators for the amount of discretionary time are: 
the inverse of the number of adults in a household and (the 
amount) of labour force participation. Of course, there are 
also obligations away from employment tasks such as 
commitments to friends and relatives that also might raise 
the opportunity costs of a survey. 


2.2.2 Social Exchange 

Social exchange theory considers the perceived value of 
equity of long-term associations between persons or 
between a person and societal institutions (Blau 1964). 
Central to all conceptualisations of social exchange is the 
notion that, unlike economic exchange, all social commod- 
ities are part of an intuitive bookkeeping system in which 
debts (e.g., obligations) and credits (e.g., expectations) are 


taken into account (G&C 1998). The social exchange 
perspective can be applied whenever there is an ongoing 
relationship between the survey organisation and the sample 
person (e.g., government surveys). 

Those receiving fewer services from the government 
may — in considering the cumulative effect of multiple 
government contacts — feel less need to cooperate. Since 
government services are disproportional across socio- 
economic strata, indicators of Socio-Economic Status (SES) 
should reflect exchange influences on survey participation. 
However, a major problem with social exchange theory is 
that two alternative hypotheses between SES and 
cooperation might be deduced from it (G&C 1998). First, 
one can argue that lower SES groups may have the greatest 
indebtedness to the government for the public assistance 
they may receive. Higher SES groups feel far less that they 
owe any sort of repayment. In this perspective, the 
relationship between socio-economic status and cooperation 
propensity is a negative one. Alternatively, a curvilinear 
relationship between SES and cooperation may be 
hypothesised. The lowest SES groups may believe that they 
are disadvantaged routinely compared to more fortunate 
people. The highest SES groups feel themselves repeatedly 
targeted in terms of time and money but receive little in 
return. In such a hypothesis, both the highest and the lowest 
SES feel relatively deprived in the relationship with 
large-scale social institutions and tend to refuse survey 
cooperation. 


2.2.3 Social Isolation 


Closely related to the social exchange hypothesis is the 
social isolation hypothesis. Social isolates are out of touch 
with the mainstream culture of a society: they tend to 
behave in accordance with subcultural norms or in explicit 
rejection of those of the dominant culture. They are 
believed to be less likely to participate in a variety of social 
and political activities, including responding to surveys 
(Couper, Singer and Kulka 1997). In terms of SES, social 
isolation theory implies a positive relationship between SES 
and cooperation: lower SES groups are resentful of their 
dependence on the government, whereas higher SES groups 
have a greater sense of civic obligation. Such a positive 
relationship between SES and social isolation is opposite to 
the relationships predicted by social exchange theory. 

Demographic indicators of social isolation are race, 
ethnicity, age and gender; with minorities, elderly and men 
in the role of the relatively isolated. Indicators of social 
isolation at the micro-level include whether the sample 
person lives in a single-person household, whether the 
sample person has any children, whether the sample person 
has moved recently and whether the sample person lives in 
a large multiunit structure. 
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2.2.4 Urbanicity 


At the community level contextual factors such as 
urbanicity, population density, crime rates and lack of social 
cohesion are hypothesised to influence survey cooperation. 
Residents of rural areas tend to cooperate at a higher level 
compared to residents in towns. However, it is not clear 
which mechanism is responsible for this urbanicity effect 
which might be explained in terms of greater population 
density, higher crime rates and higher social disorganisation 
that are associated with life in urban areas. Population 
density is hypothesised to reduce cooperation through the 
experience of crowding. Fear of crime may produce an 
unwillingness to provide information to strangers. Finally, 
urban life is associated with social disorganisation, charac- 
terised by weakened local kinship and friendship networks 
and reduced participation in local affairs. 


3. DATA AND METHOD 


3.1 Data 


In this study we make use of both aggregated and 
micro-level data of the Belgian 1991 Census linked to the 
response status for respondents and nonrespondents from 
the Belgian Fertility and Family Survey (FFS-Belgium 
1991) held shortly after the Census operations. 


3.1.1 The FFS Survey (1991) 


The Fertility and Family Survey in Belgium was 
organised by the Population and Family Study Centre 
(CBGS), a Scientific Institute from the Flemish Gov- 
ernment. This survey was carried out between April and 
October 1991, which is very close to the decennial census 
date: April 1 in the same year. The main focus of the 
FFS-project is on reproductive behaviour, to be seen 
however in the broader context of partnership and family 
history, and the interaction between employment and 
reproduction (Cliquet and Callens 1993; Callens 1995). The 
target population consists of men and women of Belgian 
nationality, born in the period 1951-1970 and with main 
residence in the Flemish Region of Belgium. 

A two-stage cluster sampling design was used for men 
and women separately. In a first stage, municipalities were 
selected from various socio-economic strata (Vanneste 
1989). In each selected municipality, individuals were 
selected at random. In this way 2,975 women and 1,989 
men were selected to take part in the survey. A fieldwork 
method was used to compensate for non-response: stratified 
random substitution of nonrespondents of the target sample 
by persons selected from a reserve sample (Chapman 1983; 
Vehovar 1999). 


TAT 


The final sample size, i.e., including the substitution 
operation, equals 4,776 persons (2,897 women and 1,879 
men). In this study we make use of respondents and 
nonrespondent cases of both the initial target sample and 
the fieldwork substitution operation (N= 6,847). 

Among both men and women, the nonresponse can be 
ascribed in 7 out of 10 cases to a refusal to participate in the 
survey. In 2 out of 10 cases, nonresponse is due to the fact 
that the persons selected could not be contacted, and in 1 
out of 10 cases, an interview was impossible because of 
sickness, language difficulties or some other reason. 


3.1.2 Matching 1991 Census Person-Level Data 
(1991) 


Our primary source of information on both respondent 
and nonrespondent cases is provided by the 1991 Census. 

In an effort to reconcile privacy concerns and scientific 
interests, we used a simple technique to make the matching 
of person-level Census data and survey data anonymous. 
We provided a dataset to the National Institute of Statistics 
(NIS) containing only the national identification number 
and the response status for each respondent and non- 
respondent case. As a result of the matching operation by 
the NIS, we received a selection of the 1991 Census data 
enriched with only two survey variables: the response status 
variable and an indicator whether a sample person belongs 
to the base or substitute sample. 

The 1991 Census individual level data we have at our 
disposal are: the individual form and the house unit form. 
The individual form contains information about: the place 
of residence, the nationality, the labour force activity status, 
the first marriage, the birth year of the children, education 
and professional activities. The house unit form includes 
information on the housing unit of the household such as: 
the type of housing unit, the number of housing units in the 
building, ownership, building period, the number of rooms 
and corresponding squared meters, the presence of a 
telephone and comfort indicators such as the number of 
bath rooms. 


3.1.3 Contactability and its Determinants 


To study the process of contactability, we ideally need 
data on the outcomes of all successive attempts to contact 
sample persons. In this study however, we do not have such 
detailed information at our disposal: we only know the final 
outcome of each survey request. Therefore, we can only 
study the probability of ever making contact with the 
sample person (coded 1 = contact and coded 0 = non- 
contact) and not whether it was easy or difficult to make 
contact. Sample persons that are known not to reside 
(anymore) on the sample address we do consider contacted. 
At 241 out of 6,847 sample units (3.52%), all contact 
attempts failed. 
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The data we use are measured at two levels: the indi- 
vidual level (n= 6,847) and the municipality level (n=123). 
At the sample person level, we consider three types of 
variables: physical impediments to contact sample persons, 
reasons for sample persons to be present in their homes and 
control variables. 

As there are no direct interviewer observations of 
physical impediments available to us, we have to rely solely 
on indicators for physical impediments available in the 
Census data. Three variables are used: whether the housing 
unit is a single-family structure or not, whether the housing 
unit is large (more than 10 units) or not and whether the 
sample person has a telephone or not. 

Determinants of at-home patterns in this study are: civil 
status (unmarried, married and divorced), age (20-24, 
25-29, 30-34 and 35-39 years) and activity status (inactive 
vs. other). For women only, we also consider the number of 
children (0, 1, 2 and 3+). For those in the labour force we 
have also detailed information about: working part-time vs. 
working full-time, the number of weekly working hours 
(<21, 21-35, 36-42, >42 hours), employment status 
(employee vs. own-account), having a second job or not and 
working at home or not. 

We also use two control variables: substitution (whether 
a sample person originates from the base target sample or 
from the substitution sample) and gender (whether a sample 
person comes from the female sample or from the male 
sample). 

At the municipality level (1=123), we use five variables: 
population density (persons per square km for the residence 
of the sample person), urban status (the cities of Antwerp 
and Gent vs. other municipalities), percentage multi-unit 
structures (in quartile format: <7.13, 7.13-15.14, 15.14-27 
and >27), percentage homes owner-occupied (in quartile 
format: <64.5, 64.5-71, 71-77.7 and >77.7) and percentage 
persons of minority race (in quartile format: <0.90, 
09-222, 2.22-5.29 and >5.29). 


3.1.4 Cooperation and its Determinants 


We are interested in the probability of ever getting 
cooperation (coded 1 = cooperation and coded 0 = non- 
cooperation) conditionally on contact; not whether it was 
easy or difficult to get cooperation from the sample person. 
For 1,399 out of 6,606 contacted sample persons (21.18%), 
all attempts to get cooperation failed. 

Again, the data we use are measured at two levels: the 
individual level and the municipality level. At the sample 
person level, we have indicators for the opportunity costs 
hypothesis, the exchange hypothesis and the isolation 
hypothesis. Substitution is used as a control variable. 

Indicators for the opportunity costs hypothesis are: 
activity status (inactive vs. other), working part-time vs. 


working full-time, the number of weekly working hours 
(<21, 21-35, 36-42, >42 hours) and employment status 
(employee vs. own-account). 

Indicators for Socio-Economic Status in our study are: 
the surface of the living rooms (in squared meters: <65, 
65-84, 85-104, 105-124 and >125), the number of bath- 
rooms (0, | and 2+) and educational level (primary, second- 
ary — first stage, secondary — second stage, high — non- 
university and high — university level). Other exchange 
hypothesis indicators are: whether one receives a replace- 
ment income from the government or not and whether the 
house is owner-occupied or not. 

Indicators for the social isolation hypothesis are: gender, 
civil status (unmarried, married and divorced), age (20-24, 
25-29, 30-34 and 35-39 years), single-family structure of 
the housing unit and for women only: the number of 
children (0, 1, 2 and 3+) and the presence of children under 
the age of five years. Finally, substitution is included as a 
control variable. 

At the municipality level, we use the same five variables 
as in section 3.1.3: urban status, population density, 
percentage multiunit structures, percentage owner-occupied 
and percentage persons of minority race. 


3.2 Method of Analysis 
3.2.1 Bivariate ? -Test 


In a first exploratory series of analyses of the correlates 
of contactability and cooperation, we calculate percentages 
for two-way contingency tables and include the results for 
the y’-test of independence against association. Such a 
x’ -test, like any significance test, indicates the degree of 
evidence for the existence of an association, not the strength 
of an association. When at least one variable is ordinal, 
more powerful tests of independence than the y7-test such 
as the linear trend test do exist, but for reasons of simplicity 
of presentation, we do not use them in this paper. 


3.2.2 Multilevel Logistic Regression 


In a second series of analyses, we use multilevel logistic 
regression to simultaneously estimate the impact of the 
various determinants (Snijders and Boskers 1999). We opt 
for the use of a multilevel method, because we regard our 
data as hierarchically nested with individuals at the lower 
level (level 1) and municipalities at the higher level (level 
py 

Let Pi be the probability that an individual 7 belonging 
to municipality j is contacted (or cooperates). We will 
consider four different models for explaining this proba- 
bility: the null random model, two versions of the random 
intercept model and the standard logistic regression model. 

The empty or unconditional model does not take explan- 
atory variables into account. We specify the model such that 
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logit transformed probabilities Pij have a normal 
distribution: 


logit (P;;) = 1/0 + exp(p;,)) = Yo + Uy, 


where y, is the population average and Uy; the random 
deviation from this average for group j. These deviations 
Uy; are assumed to be independent normally ee: 
random variables with mean zero and variance T). 

When there are r variables at the individual level that are 
potentially explicative for the observed outcomes, then they 
are incorporated as a linear function in the random intercept 
model: 


logit (pj) = Yo + DY Xniz * UO; 
h=l 


where y,,...,Y, are the slope parameters measuring the 
effect of the explicative variables. 

If we would drop the random effects u 0j then we obtain 
a standard logistic regression model: 


logit (p;;) = Yo + i, Xnij- 
h=l 


By also including s variables at the community level, we 
get an intercept model with both level-1 and level-2 
covariates: 


lozit (D,). = ¥o + DU Yiu | 
h=1 k=1 


We use SAS Proc Nimixed (SAS Institute 1999) to 
actually estimate the parameters. In SAS Proc Nlmixed an 
adaptive version of Gauss-Hermite Quadrature (numerical 
integration) is used to solve the maximum likelihood 
estimation problem. To test if a specific parameter equals 
zero, a Likelihood Ratio -test is used. 


4. RESULTS 


4.1 Contactability 


Table 1 presents the bivariate results by the ’-test of the 
percentage never contacted by various indicators of 
physical impediments. One strong correlate is whether the 
housing unit is a single-family structure or not, the latter 
having much higher noncontact rates (8.1%) than other 
units (2.4%). Also, sample persons living in large multiunit 
housing structures tend to have higher noncontact rates 
(11%) than those not living in large multiunit housing 
structures (3.1%). Another strong correlate is the presence 
of a telephone: 9.7% of those with no telephone were never 
contacted. 
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Table 1 
Percentage Never-Contacted by ‘Physical Impediments’ Attributes 
Percentage 
Physical impediments attributes never contacted 7 df Dp 
Single-Family Structure 7 oral <0.0001 
No 8.1 
Yes 2.4 
Large multi-unit structure (>10) 38.4 1 <0.0001 
No Ball 
hes 11.0 
Telephone 88.9 1 <0.0001 
No OMT 
Yes a 


Table 2 shows the bivariate results for contactability by 
‘reasons to be present at home’ attributes. Relatively more 
unmarried (4.4%) and divorced (6.9%) sample persons than 
married (2.9%) sample persons are never contacted. There 
are much lower rates of noncontacts among those that are 
inactive (0.9%) compared to other persons (3.5%). Having 
at least 3 or more children (0.9%) leads to low noncontact 
rates, compared to having two children (2.6%) or at most 1 
child (4%). Those working at home (1.5%) and those being 
an independent worker (1.9%) show modestly lower 
noncontact rates than those working elsewhere (3.6%) or 
those working as an employee respectively (3.6%). Age, the 
number of weekly working hours, working part-time vs. 
full-time and having a second job or not have no significant 
influence on contactability. 


Table 2 
Percentage Never-Contacted by ‘Reasons to be Present at Home’ 
Attributes 
Percentage 
Reasons to be present athome nevercontacted 7 af P 
Civil status 194 2 <0.0001 
Unmarried 4.4 
Married 2.9 
Divorced 6.9 
Inactive vs. other 40 1 0.04 
Inactive 0.9 
Other 3) 
Number of children* 1A i une.0023 
0 4.3 
1 4.0 
2 2.6 
3+ 0.9 
Employment place? AON Me 0:03 
At home 15 
Elsewhere 3.6 
Employment status? BOM 2 0:05 
Employee 3.6 
Own-account 1.9 


subsample of women only (n=4,098) 
subsample of active persons only (n=5,368) 
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In addition, substitution is associated with higher 
noncontact rates (5.9%) compared to the base sample 
(2.6%). No significant difference has been found for the 
male and the female subsample. 

In a multiple logistic regression model of the combined 
effects of those individual-level indicators that have some 
marginal bivariate effect on contactability only single- 
family structure (x7 = 35.75, p = <0.0001), telephone (77 = 
52.63, p=<0.0001) and substitution (7 =28.59, p= 
<().0001) remain significant. 

In Table 3, noncontact rates for various environmental 
attributes are presented. Cities (6.6%) have higher non- 
contact rates compared to nonurban areas (3.1%). The 
percentage never contacted is higher for high-density areas 
(5.4%) than low-density areas (1.7%). The presence of 
multiunit structures and the presence of persons of other 
nationalities tend to increase non-contact rates. Finally, the 
percentage of owner-occupied houses shows a negative 
association with noncontact rates. 


Table 3 
Percentage Never-Contacted by ‘Environmental’ Attributes 
Percentage 
Environmental attribute never contacted 7 df Dp 
Urban status 24.0 1 <0.0001 
Cities 6.6 
Other Sal 
Population density 34.4 3. <0.0001 
Lowest quartile Ley 
Second quartile oi2 
Third quartile 3.8 
Highest quartile 5.4 
% Multi-unit structures 50.4 3. =<0.0001 
Lowest quartile 2.0 
Second quartile DY! 
Third quartile 4.0 
Highest quartile oy) 
% Persons of other nationalities 231 3. =<0.0001 
Lowest quartile Me 
Second quartile 2.3 
Third quartile 4.3 
Highest quartile 4.8 
% Homes owner-occupied 64.4 3. =—-<0.0001 
Lowest quartile 6.4 
Second quartile 3.6 
Third quartile 1.6 
Highest quartile Gl 


We complement now the bivariate analysis with a 
multivariate analysis. In Table 4 four models for modelling 
contact relative to noncontact are presented. Model | is the 
null random model at the municipality level. Model 2 is a 
multiple logistic regression model. In this model, we have 
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included the person-level effects that remained significant 
in a multivariate context (i.e., single-family structure, tele- 
phone and substitution) and the variable activity status 
because of its theoretical importance. Model 3 is a random 
intercept version of model 2. In Model 4, we have extended 
Model 3 with the municipality level variable multi-units 
structures (in %) only. 


Table 4 
Results of (Multilevel) Logistic Regression Models 
of Contactability 
Model 1: Model2: Model 3: Model 4: 
Results Null Logistic Random Random 
Random Regression Intercept Intercept 
Levell Level 1&2 
Intercept AOUR ESS 1OScae 3.6857 ed oss 
(0.16) (0.73) (0.77) (0.79) 
Individual 
Characteristics 
Single-family structure Gres OZ OO ees 
(0.15) (0.17) (0.17) 
Telephone IOS E25 te RIAD Oa 
(0.16) (0.17) (0.17) 
Inactive vs. other -1.23 -1.34 -1.33 
(0.72) (0.75) (0.74) 
Substitution sample “OSE” 5-064 5 ee O2 ee 
(0.14) (0.15) (0.15) 
Municipality 
Characteristics -0.02* 
Multi-unit structures (%) (0.01) 
Estimated variances 
Var(Intercept) 1.03 0.82 0.79 
Goodness of fit 
Deviance 120 1,658 1,606 1,599 
Notes: Standard errors in parentheses. *p<0.05, ** p<0.01, 


***n<0.001, one-tailed tests. 


The effects of the person-level covariates in Models 2, 3 
and 4 are in accordance with the findings of the bivariate 
analysis. Single-family structure and the presence of a 
telephone have a positive influence on contactability, while 
the effect of activity status is not significant. The impact of 
field substitution is negative. We also notice a (rather small) 
reduction of the regression coefficient for single-family 
structure and substitution in the multilevel models 3 and 4. 
Models 3 and 4 have one variance component for the 
intercept. To test the null hypothesis that the random 
intercept variance equals zero, we use the Likelihood Ratio 
test and compare the conventional logit model (Model 2) 
with the random intercept model (Model 3). The difference 
in deviance between both models is large (52). So, there 
might be some variance in the intercept to explain by 
municipality level covariates. By introducing municipality 
characteristics one at a time, we can test for significant 
effects by calculating deviance differences between Model 
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4 and Model 3. The only deviance difference of importance 
noted is the case of the variable ‘multi-unit structures’ (7 
units difference). No differences in deviances are found for 
the introduction of the other level-two variables (urban 
status, percentage owner occupied, population density and 
persons of other nationalities). 

We consider Model 3 and Model 4 as the better models. 
According to these multilevel models, noncontact rates vary 
considerably across municipalities. However, the munici- 
pality level covariates in our study are not able to explain 
much of this variation. 


4.2 Cooperation 


In Table 5, we present the bivariate results for the 
opportunity costs hypothesis indicators. Being inactive or 
not does not seem to have an effect on the cooperation rate. 
However, when we use indicators of discretionary time, 
such as working part-time versus working full-time or the 
weekly number of working hours, the predicted negative 
relationship does show up in the bivariate results. In 
addition, self-employed sample persons have lower co- 
operation rates compared to employees. 


Table 5 
Percentage Cooperation by ‘Opportunity Cost Hypothesis’ 
Indicators 
Percentage 
Opportunity cost indicators cooperated a dia? 
Inactive vs. other 0.41 i! hey? 
Inactive 77.0 
Other 78.9 
Part-time vs. Full-time? 10.04 1 0.001 
Part-time 82.3 
Full-time 774 
Number of working hours” 1533 397 410:0016 
<20 80.1 
21-35 84.7 
36-42 77.6 
> 43 (Bi) 
Employment status” AD) LenOl04 
Employee 78.7 
Own-account 74.6 


: subsample of active persons only (n=5,180) 


The predictions of the exchange hypothesis theory do not 
show up in the bivariate results presented in Table 6. SES 
indicators like the surface of the living room and the 
number of bathrooms are not negatively, but positively 
related to cooperation. Of course, these measures are not 
ideal, because we are not able to control for household size. 
Another indication of a positive relationship between 
cooperation and SES is the case of educational level. 
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Whether one receives a replacement income or not and 
whether the house is owner-occupied or not has no impact 
on cooperation rates. 

In a multiple logistic regression model of the combined 
effects of those social exchange indicators that have some 
marginal bivariate effect on cooperation, only the effects of 
educational level (x? =39.35, df=4, p <0.0001) and surface 
of the living room (y* =13.4, df=4, p=0.0095) remain 
significant. 


Table 6 
Percentage Cooperation by ‘Exchange Hypothesis’ Indicators 
Percentage 
Exchange indicators cooperated dike? 
Surface living rooms (m’) 26.8 4 <0.0001 
<65 74.8 
65 - 84 77.6 
85 - 104 78.6 
105 - 124 IQs: 
= 1s 83.1 
Number of bathrooms 7.9 2.002 
0 74.2 
1 78.6 
2 83.5 
Educational level 46.7 4 <0.0001 
Primary 76.6 
Secondary, first stage 74.5 
Secondary, second stage 78.7 
High, non-university 85.1 
High, university 82.2 
Replacement income 0.3 Ih so Owanss 
No 78.7 
Yes 79.5 
Owner occupied 3.4 1 0.06 
No 77.4 
nics 79.4 


In the section for the exchange hypotheses, we have 
found support for the notion that those with low SES, 
cooperate less with surveys than those in the high SES 
groups. Such a positive relationship between SES is 
predicted by the social isolation hypothesis. Demographic 
indicators of social isolation theory are gender, civil status 
and age (See Table 7). No effects are found for gender, civil 
status (however, divorced sample persons are probably less 
cooperative) and single-family structure. Age seems to have 
a negative effect on cooperation. For women only, we have 
also data on the presence of children. We find that the 
number of children has a positive effect on cooperation 
rates. The age of the children is also important: the presence 
of young children is associated with higher cooperation. 

The control variable substitution has a slightly negative 
effect on cooperation (77 =4.24, p=0.039) with lower 
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cooperation rates for the substitution sample (77.3%), 
compared to the base sample (79.5%). 


Table 7 
Percentage Cooperation by ‘Social isolation Hypothesis’ Indicators 
Percentage 
Social isolation indicators cooperated Ye df p 
Gender 1.56 lp eO2d 
Male 78.1 
Female 1933 
Civil status Salut ae XO 
Unmarried 79.8 
Married 78.6 
Divorced 75.4 
Single-family structure 0.76 Lar 0:38 
No 78.9 
Yes (bes 
Age EWES 30.0006 
20 - 24 80.8 
25 - 29 80.7 
30 - 34 78.3 
35 - 39 T35 
Number of children* 18.2 3. ~=0.0004 
0 71.9 
1 76.3 
2 81.7 
3+ 84.9 
Presence of young children* 13 1 0.0005 
No 77.8 
MES 82.8 


“ subsample of women only (n=3,955) 


Table 8 contains the bivariate results for social 
environmental differences in cooperation. Population 
density has a curvilinear effect on cooperation. Being a 
resident in a large metropolitan area has no effect. Thus, the 
evidence for the literature that crowding and high levels of 
stimulus input are negatively associated with cooperation is 
of a mixed nature. 

The effect of indicators for social cohesion is not clear. 
Only the variable percentage owner-occupied has a 
(curvilinear) effect. The variables percentage persons of 
other nationalities and percentage multi-unit structures 
seem to have no effect. 

Finally, we present in Table 9 a series of regression 
models for cooperation similar to those in section 4.1. In 
these models, we have included four individual level 
covariates: surface of the living room (<84 , >84 m?), 
education (up to secondary -second stage vs. high level), 
age (20-29, 30-39 years) and substitution sample. Surface 
of the living room and education have been selected as the 
only significant exchange hypothesis indicators in the 
previously described multiple logistic regression model. 


Age was the only significant effect in the bivariate analysis 
on the social isolation hypothesis. Finally, substitution is 
introduced to control for possible fieldwork effects. The 
slightly negative effect of substitution in Model 2 might 
indicate that fieldwork substitution negatively influences 
cooperation. However, this effect disappears completely 
when a random intercept is introduced (Models 3 and 4). 
The effects of the other individual level covariates are in 
accordance with the findings of the bivariate analysis and 
do not change across Models 2 to 4. SES indicators like 
education and surface of the living room have a positive 
effect and age has a negative effect on cooperation. These 
effects rather confirm the social isolation hypothesis than 
the exchange hypothesis. 


Table 8 
Percentage Cooperation by ‘Environmental’ Attributes 
Percentage 
Environmental attribute cooperated 7 df p 
Urban status 0.84 1 Ouse 
Cities 80.1 
Other 78.7 
Population density 10.7 3 0.014 
Lowest quartile 80.0 
Second quartile 19.9 
Third quartile 76.0 
Highest quartile 79.4 
% Multiunit structures Sal SOS 
Lowest quartile 80.1 
Second quartile We 
Third quartile 771.9 
Highest quartile 78.1 
% Homes owner-occupied 12.3 3. 0.0063 
Lowest quartile TOT 
Second quartile 76.2 
Third quartile 78.5 
Highest quartile 80.9 
% Persons of other nationalities Se Ste OG 
Lowest quartile WS 
Second quartile 77.6 
Third quartile 79.6 
Highest quartile 80.2 


The only level two variable of (modest) importance is 
multi-unit structures (in %) and has been kept in Model 4. 
The Likelihood Ratio test for introducing this variable gives 
a difference of two units in deviance terms. The intro- 
duction of one or more other second level variables gives 
Likelihood Ratio tests differences close to zero in deviance 
terms. We consider Model 3 and 4 as the most suitable 
models. The difference in deviance terms between model 3 
and model 2 is 8 units, which is significant. The variance 
for the intercept term is moderate (0.21). The introduction 
of second level covariates (including multi-unit structures) 
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leaves this variance term practically unchanged. Therefore, 
we may state that environmental attributes like urbanicity 
are not important for explaining cooperation. 


Table 9 
Results of (Multilevel) Logistic Regression Models of Cooperation 


Model 1: Model 2: Model3: Model 4: 


Results Null Logistic Random Random 
Random Regression Intercept Intercepts 
Level! Level 1&2 
Intercept eA DAS Ae eS Oe eS Oger 
(0.06) (0.06) (0.08) (0.10) 
Individual 
Characteristics 
Substitution sample -0.15* -0.03 -0.02 
(0.07) (0.07) (0.07) 
Surface living rooms Ones O.24*** = .24*** 
(0.06) (0.06) (0.06) 
Educational level O4Sa54 OM Tce OAT Rae 
(0.08) (0.08) (0.08) 
Age =I Se Oe er at 
(0.06) (0.06) (0.06) 
Municipality 
Characteristics -0.006 
Multi-unit structures (%) (0.004) 
Estimated variances 
Var(Intercept) 0.21 0.21 OR 
Goodness of fit 
Deviance 6,664 6,664 6,596 6,594 
Notes: Standard errors in parentheses. * p<0.05, ** p<0.01, 


*** n<(0).001, one-tailed tests. 


5. DISCUSSION 


In this paper, we have used 1991 individual and munici- 
pality level Census data matched to the response status 
variable of the Belgian Fertility and Family Survey to 
analyse the relative importance of correlates of contact and 
cooperation. 

We have organised our analysis according to the 
Groves-Couper conceptual framework. In the bivariate 
analysis stage, we have found essentially the same kind of 
correlates as was predicted and actually found in an 
US-based multi-survey analysis (G&C 1998). One 
important difference between the present study and the 
US-results seems to be the nature of the effect of SES 
indicators (e.g., education) on cooperation. In the present 
study, we find a positive relationship; in the US-study the 
inverse relationship is found. We can imagine two 
alternative explanations for these conflicting findings. A 
first one is based on survey design effects such as topic 
saliency. The FFS-survey in Belgium might be atypical in 
being disproportionally attractive to the higher educated 
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because of the specific content of the survey. Replicating 
the present analysis for surveys about varying topics can 
easily test such a hypothesis. Another possible hypothesis 
is that effects of education on survey cooperation do vary 
across societies. Then the challenge is to find out why this 
relationship varies across countries. Such a hypothesis is far 
less easy to test in real, as data for several countries are 
needed. 

In the multilevel logistic regression analysis stage, the 
impact of all but one contextual factor completely vanished. 
Only the impact of the variable percentage of multi-unit 
structures shows, however only weakly, some resistance 
against ecological randomness present in the random 
intercept models. To us, this is a very intriguing result. 
Random ecological variation at the municipality level 
seems to dominate largely even the urban-rural dichotomy. 
A possible explanation is that the variation at the 
community level is dominated by interviewer effects, not by 
ecological factors. 
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In This Issue 


This issue of Survey Methodology opens with a discussed paper by Paul Biemer. He provides 
evidence of reduced accuracy due to the redesign of employment questions in the Current Population 
Survey (CPS). This is an extension of the previous study by Biemer and Bushery (2000). In the 
current paper, the author attempts to trace the source of the error through extended analysis of the CPS 
data before and after the redesign. A new approach, using Markov Latent Class Analysis, is presented. 
This work aims at providing guidance for further investigation into the root causes of the errors in the 
collection of labour force data in the CPS. Discussions of this paper are provided by Jeroen Vermunt, 
Stephen Miller and Anne Polivka, and Clyde Tucker. 

In their paper, Gunning and Horgan propose a new algorithm for the construction of stratum 
boundaries in skewed populations. Their algorithm uses an auxiliary variable and achieves equal 
coefficients of variation for this auxiliary variable in each stratum. The method is based on the 
assumption that the auxiliary variable is uniformly distributed. One advantage of the method is that it 
is very easy to apply in practice. In an empirical study, the authors show that the proposed algorithm 
compares favourably with the cumulative root frequency method of Dalenius and Hodges (1957) and 
to the Lavallée and Hidiroglou (1988) algorithm. 

Hedlin and Wang consider the problem of bias coming from feeding back information from sample 
surveys to frames. They investigate the bias incurred by updating deaths on a frame that is used for 
future occasions of the same survey. They quantify this bias and develop an unbiased estimator for 
this situation. The theoretical results presented in the paper are illustrated through a simulation study. 

In their paper, Mudryk and Xie present the Quality Assurance (QA) and Quality Control (QC) 
aspects of the Intelligent Character Recognition operation of the 2001 Canadian Census of 
Agriculture. They show how an effective QA and QC plan was developed to ensure the highest quality 
data from the data capture operation of the Census. Results from an analysis of the Average Outgoing 
Quality of the data indicate the importance of a QA/QC plan. 

In Park and Lee, the design effects for the weighted mean and total estimators are investigated for 
complex surveys. In particular, they decompose the design effect for the weighted mean and total 
estimators under a two-stage design. Given this decomposition, they illustrate several common 
misconceptions about the design effects for the weighted mean and total estimators through several 
examples using commonly used designs. 

In their paper, Beaumont and Alavi investigate a robust generalized regression estimator. They look 
at alternatives to the optimal Best Linear Unbiased (BLU) estimator that are robust to design 
ignorability and/or model misspecification. In the situation where the design ignorability assumption 
may not hold, they propose a least squares estimator that is obtained by shrinking the design weights 
to their mean. To deal with model misspecification, they propose a weighted generalized M-estimator 
to reduce the influence of units with large weighted population residuals. Their theoretical results are 
illustrated with a simulation study. 

Zheng and Little propose a non-parametric model-based alternative to Horvitz~Thompson 
estimation of a total in the case of two-stage sampling with pps sampling at the first stage. This is an 
extension of their earlier work in which an outcome variable y, is modeled as a smooth function of 
the inclusion probability 2, . They show how to fit the model and estimate the total using a penalized 
spline, and also develop alternative variance estimation procedures. Simulations are used to compare 
the proposed method to the Horvitz-Thompson estimator and to a model-assisted estimator. 


126 In This Issue 


Liang and Kuk consider an alternative to the standard approach for regression estimation in a finite 
population. Instead of the usual linear model they use an arbitrary smooth function to allow for a non- 
linear regression, and then they apply Bayesian neural networks to the problem. The advantage of the 
neural network approach is that the problem of model misspecification is avoided. Liang and Kuk 
place a prior on each network connection instead of on the number of hidden units as is usually done. 
This permits a unified approach to the selection of the network structure and the selection of the 
auxiliary variables. Finally, they handle outliers by introducing a heavy tail distribution to model the 
disturbances of the data. 

In the last paper of this issue, Reiter uses multiple imputation to handle simultaneously both 
missing data and disclosure limitation. The basic idea is to fill in the missing data first to generate m 
completed datasets and then replace sensitive or identifying values in each completed dataset with r 
imputed values. Then, the author develops new combining rules for obtaining valid inferences from 
such multiply-imputed datasets. These rules take into account both sources of variability in the point 
estimators. 

Finally, the Editorial Board met this past summer at the Joint Statistical Meetings in Toronto. A 
suggestion was made at that meeting to have a Short Communications section in the journal. These 
would be shorter papers, typically around four Survey Methodology pages. Possible topics of short 
communications would include presentation of new ideas without the full development of a regular 
paper, brief reports of empirical work, and discussions or supplements to other papers published in the 
journal. All short communications would be refereed, although the reviewing process may be 
streamlined. I hope that this new format will be attractive to many authors, and look forward to 
receiving your submissions. 


M.P. Singh 
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An Analysis of Classification Error for the 
Revised Current Population Survey Employment Questions 


PAUL P. BIEMER ' 


ABSTRACT 


The reduced accuracy of the revised classification of unemployed persons in the Current Population Survey (CPS) was 
documented in Biemer and Bushery (2000). In this paper, we provide additional evidence of this anomaly and attempt to 
trace the source of the error through extended analysis of the CPS data before and after the redesign. The paper presents an 
novel approach decomposing the error in a complex classification process, such as the CPS labor force status classification, 
using Markov Latent Class Analysis (MLCA). To identify the cause of the apparent reduction in unemployed classification 
accuracy, we identify the key question components that determine the classifications and estimate the contribution of each 
of these question components to the total error in the classification process. This work provides guidance for further 
investigation into the root causes of the errors in the collection of labor force data in the CPS possibly through cognitive 


laboratory and/or field experiments. 


KEY WORDS: Survey redesign; Measurement error; Latent class analysis; Unemployment rate; Specification error. 


1. INTRODUCTION 


The Current Population Survey (CPS) is a monthly 
survey of approximately 60,000 households conducted by 
the U.S. Bureau of the Census for the Bureau of Labor 
Statistics (BLS). The primary purpose of the survey is to 
provide estimates of employment, unemployment, and other 
characteristics of the general U.S. labor force population. 
Estimates of the size, composition, and dynamic charac- 
teristics of the labor force are published each month by BLS 
and comprise one of the Nation’s key economic indicators. 

In January 1994, a revised questionnaire was introduced 
in the CPS to address the recommendations by the Levitan 
Commission in the late 1970s to convert the mode of 
interview for the CPS from paper and pencil questionnaire 
to computer-assisted interviewing methods, to clarify some 
of the questions on employment, as well as for a number of 
other reasons described in Rothgeb (1994). The overall 
objective of the redesign was to improve the quality of the 
data collected in the CPS. The CPS questionnaire had 
remained essentially unchanged since the last major revision 
in 1967. 

The revised CPS questionnaire was introduced after 
considerable research and testing that began in the mid- 
1980s. The purpose of the testing was to evaluate the quality 
and operational feasibility of various redesign options 
including moving the CPS from a paper and pencil 
questionnaire format to computer assisted interviewing. 
During these years of testing, more than 100,000 persons 
were interviewed in the various studies that were conducted 
(Rothgeb 1994). The CPS redesign research program 


1 


culminated in a large national study (referred to in the 
literature as the CATI/CAPI Overlap or CCO Field Test) 
that was conducted in 1993. The key component of this test 
consisted of a computer assisted survey of approximately 
12,000 households implementing revised CPS interviewing 
procedures and the revised questionnaire. This survey, 
referred to in this report as the Parallel Survey, was 
conducted from July 1992 to December 1993 concurrently 
with the ongoing CPS survey which used the original 
questionnaire. This type of split panel design makes it 
possible to estimate the effect of the redesign changes on the 
CPS labor force estimates. 

A number of papers and reports were published 
documenting the findings from the CCO Field Test 
(Cohany, Polivka and Rothgeb 1994; Rothgeb 1994; 
Polivka 1994; Kostanich and Cahoon 1994; Miller 1994; 
Thompson 1994; Dippo, Polivka, Creighton, Kostanich and 
Rothgeb 1994). One key finding from this research was that 
the Parallel Survey unemployment rate and the labor force 
participation rate were higher than in the CPS. The higher 
unemployment and labor force participation rates associated 
with the revised questionnaire were explained primarily by 
changes in the definition of employment. The revised 
questionnaire has a broader approach to both work and job 
search activities, which would tend to classify more persons 
as “in the labor force” and, thus, more persons who are not 
working as unemployed rather than out of the labor force 
(see, for example, Polivka 1994 and Rothgeb 1994). 

The increase in the unemployment rate due to the new 
design was originally estimated at about one-half percentage 
point. However, further analysis of the Parallel Survey data 
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called that estimate into question and subsequently a report 
was release estimating the increase to be less than one-tenth 
percentage point (Polivka and Miller 1994). The concerns 
raised in the subsequent reports regarding the utility of the 
Parallel Survey data for assessing the effect of the redesign 
are discussed further below and will be considered in our 
analysis of these data. 

An independent analysis conducted by Biemer and 
Bushery (2000) revealed an anomaly in the revised CPS 
labor force data that had not been detected by any of the 
previous research on the CPS redesign. Using a Markov 
latent class analysis (MLCA) approach, Biemer and 
Bushery compared the accuracy of labor force classify- 
cations under the original and revised designs by estimating 
and comparing the error rates using the 1993 CPS data and 
the 1995 and 1996 CPS data. They defined labor force 
classification accuracy as the probability that a person who 
is truly in some labor force category, say category a, 1s 
classified as being in a by the CPS; i.e., Pr(classified in a | 
truly in a). For example, the classification accuracy for 
unemployment is the probability a person who is truly 
unemployed, according to the CPS definition, is correctly 
classified as unemployed by the CPS classification rules. 

In Table 2 of their paper, Biemer and Bushery report that 
the classification accuracy for unemployment dropped by 
5.7 percentage points, from approximately 81.8 percent 
(s.e. = 0.90) in 1993 to 76.1 (s.e. = 1.2) in 1995 and 74.4 
percent (s.e. = 1.2) in 1996. These results suggest that the 
redesigned CPS misclassifies the true unemployed at a 
higher rate than the old CPS design. The authors first 
considered that this result could be an artifact of the MLCA 
methodology. As shown below, MLCA does not require a 
true or “gold standard’ measurement of employment to 
estimate classification error. Rather the method relies a 
model describing the true month to month changes in 
employment status and as well as for the process of 
classifying individuals into labor force categories. It is 
possible that labor force transitions that deviate from the 
model specification could be regarded as misclassifications 
in the estimation process. 

To check the validity of the MLCA results, the authors 
conducted a series of analyses using traditional estimation 
approaches, analysis of the error by population groups, 
comparisons of the error estimates to other published esti- 
mates, and simulations to assess the effect of model failure 
on the results. As an example, there is evidence that the test- 
retest reliability of the unemployment category decreased 
after the redesign. Prior to the redesign, the index of 
inconsistency (The index of inconsistency is a measure of 
unreliability traditionally used at the Census Bureau. It is 
equal to 1— « where k is Cohen’s kappa coefficient (Cohen 
1960) for the unemployed labor category averaged 30 


percent for the period 1992—1993. Following the redesign, 
the index of inconsistency increased to almost 40 percent for 
the period 1995-1996. These analyses support their claim 
that the accuracy of the CPS methodology for classifying 
unemployed persons declined after the redesign. 

In their discussion of the results, the authors speculated 
that the drop in classification accuracy could indicate a 
problem with the revised unemployment questions. That is, 
the revised unemployment questions may be subject to 
greater classification error and, thus, less classification 
accuracy. Another possibility they considered is change in 
the characteristics of the unemployed populations from 
1993 to 1995 and 1996. Since the unemployment rate 
dropped from 1993 to 1996, it is possible that persons who 
would be more accurately classified by the CPS system left 
the ranks of the unemployed, leaving persons who would be 
less accurately classified in the category. This hypothesis 
could be tested by estimating the accuracy rates for the two 
methodologies for the same time period. The Parallel 
Survey offers a means to conduct such an analysis. 

The current paper continues the investigation of the 
reduction in MLCA unemployment classification accuracy 
rates observed by Biemer and Bushery. The current analysis 
uses MLCA models very similar to those used by Biemer 
and Bushery for estimating the classification accuracy for 
the original and revised versions of the CPS questionnaire. 
However, the time period considered here is expanded to 
include the15 months prior to and following the introduction 
of the revised questionnaire: a total of 30 contiguous 
months. In addition, data from the Parallel Survey from the 
period January 1993 through December 1993 is used to 
compare the employment accuracy for original and revised 
questionnaire for the same time period. 

Our analysis focuses on a labor force classification 
variable that is derived from a number of questions on the 
employment section of the CPS questionnaire. This variable 
is often referred to as a “recoded” labor force variable since 
it is determined by mapping a pattern of CPS responses to 
questions about employment onto particular labor force 
categories such as employed — at work, employed — not at 
work, unemployed — looking for work, and so on. Biemer 
and Bushery used a three-category employment classi- 
fication variable: employed (EMP), unemployed (UEM), 
and not in the labor force (NLF). For the present analysis, a 
four-category variable is used that subdivides the UEM 
category into unemployed-on layoff (UEM-LAYOFF) and 
unemployed-looking for work (UEM-LOOKING). This is 
done as a first step toward isolating the source of the 
apparent inaccuracy in unemployment classification. How- 
ever, further decomposition of these categories will be 
necessary to arrive at the root source of the error as will be 
shown subsequently. 
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In section 2 we describe the CPS labor force concepts 
that are most relevant to our study and the structure of the 
data sets in the analysis. In section 3 we review the MLCA 
estimation methodology and models used by Biemer and 
Bushery in their analysis and describe the application of 
their methodology for the present purposes. In section 4 we 
present the results of our analysis and what they suggest 
regarding the source of the classification error in the new 
questionnaire. Finally, section 5 provides a summary of the 
key findings and our conclusions from the study. 


2. DATA AND CONCEPTS 


2.1 The Data Sets for Our Study 


Except for the Parallel Survey, the CPS data in our 
analysis were downloaded from the National Bureau of 
Economic Research (NBER) website (www.nber.org). This 
website contains microdata for the CPS for every month 
from January 1976 through December 2004. The MLCA 
approach was applied directly to these microdata without 
the need for supplementary data or data external to the CPS. 

In the preliminary analysis, we investigated the CPS 
classification accuracy for a six-year period: January 1992 
through December 1997. That analysis was aimed at 
determining whether the anomaly first noted in Biemer and 
Bushery (2000) is a transient phenomenon affecting only the 
months immediately following the introduction of the new 
questionnaire or whether it persisted for some years after the 
new questionnaire was introduced. If temporary or transient, 
the anomaly might be related to problems during the phase- 
in of the new design; for example, interviewer training or 
issues related to the startup of data collection. However, 
evidence of a persistent, continuing effect could suggest 
problems with the survey design; for example, the question- 
naire, interviewing procedures, or the recoding algorithm. 

By applying MLCA across all months from 1992 
through 1997 we determined that, although the magnitude 
of the reduction in accuracy varies somewhat from month to 
month, it does indeed persist for all months following the 
introduction of the revised questionnaire. The results 
confirmed Biemer and Bushery’s conjecture of a systemic 
effect possibly linked to the new unemployment questions 
introduced in January 1994. 

Due to space considerations, in this paper we present 
results from a somewhat shorter time frame than considered 
in the preliminary analysis, viz., the years 1992, 1993, 1994, 
and 1995. This time period covers two years of the CPS 
using the original questionnaire and two years using the 
revised questionnaire. In addition, we will also present some 
results from an MLCA of the 1993 Parallel Survey data that 
can be compared with results from the main CPS. 
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The data sets in our study are quite large. Each estimate 
of classification error we obtain is based upon all house- 
holds that were interviewed in the CPS for three consecutive 
months. Across the four years in our analysis, the total 
number of households responding for all three months in 
any three-month period varies from about 37,000 to more 
than 40,000. For the 1993 Parallel Survey, the number of 
households satisfying this criterion is approximately 10,000. 
The estimates we produce are appropriately weighted for 
probabilities of selection and other post-survey adjustments 
and, therefore, reflect the response probabilities of the 
published CPS estimates. Weights were constructed by 
taking an average weight across the three consecutive 
months that were combined to form a longitudinal record 
for the analysis (unweighted analyses were also conducted 
and the results were very similar to the weighted analysis. 
This suggests the choice of weights has little effect on the 
study outcomes). 

Because of a problem in the identification variables 
required for linking households for the months June 1995 
though December 1995, it was not possible to include these 
months in our analysis. Further, since our conclusions 
would not change by including data from the 1996 or later 
years of the CPS, we confine our analysis to 15 months 
prior and 15 months following the introduction of the 
revised questionnaire. Thus, for most of the analysis to 
follow, we will provide averages of estimates from August 
1992 through December 1993 for the original questionnaire 
and from January 1994 through May 1995 for the revised 
questionnaire (note that since our estimates are based upon a 
moving average of three consecutive months, seasonal 
variations in the labor rates and transitions probabilities are 
accounted for in the estimates of classification error). 


2.2 Labor Force Concepts 


The revised CPS questionnaire was introduced in 1994 to 
improve the overall quality of labor market information 
through extensive question changes and through the use of 
computer technology in the data collection. In the following, 
we describe a few concepts that were affected by the 
questionnaire redesign and that are relevant for the current 
analysis. 


Employed. The labor force questions in the original 
questionnaire began with the question “What were you 
doing most of LAST WEEK (working, keeping house, 
going to school, or something else)?” Interviewers were 
allowed to modify the parenthetical part of this question 
according to the age of the respondent. In some cases, the 
word “work” or “working” was not part of the question. As 
an example, if the respondent looked of student-age, the 
interviewer was allowed to leave out the word “working.” 
The revised questionnaire replaced this question with two 
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questions: “Does anyone in this household have a business 
or a farm?” and “LAST WEEK, did you do ANY work for 
(either) pay (or profit)?” where the parenthetical parts of the 
question are read if anyone in the response to the first 
question is “yes.” Further, additional questions were added 
to clarify whether earnings or profits were received from the 
family business or farm. Thus, the revised questionnaire 
concept of employment appears to be somewhat broader 
and better defined than the original questionnaire concept. 

Unemployed. The definition of unemployment was 
slightly modified in the revised questionnaire. In the original 
questionnaire, persons waiting for a new job to start were 
classified as unemployed. Under the revised questionnaire 
definition, a person is unemployed only if all of the 
following are true: (1) without a job, (2) actively seeking 
work or on layoff from a job and expecting recall within the 
next six months, and (3) currently available to take a job 
(except for a possible temporary illness). 

On Layoff. Persons on layoff are defined as persons 
separated from a job and who are awaiting a recall to return 
to that job. The original questionnaire did not consider or 
collect information on the expectation of recall. This was 
problematic because to most people, the term “layoff” could 
mean permanent termination from the job rather than the 
temporary loss of work economists are trying to measure. 

Job Search Methods. To be counted as unemployed and 
looking for work, a person must have engaged in an active 
job search during the four weeks prior to the survey. The 
revised questionnaire includes a somewhat broader question 
about job search methods with expanded and restructured 
response categories to allow interviewers to more easily 
record and distinguish between active and passive job 
search activities. In addition, it provides additional followup 
questions for those who respond “nothing” or “don’t know.” 

Reference Week. While the original questionnaire 
referred to LAST WEEK, the reference period was never 


explicitly defined. The revised questionnaire provides 
specific dates of the reference week. 

We will refer to these changes later in the report when 
we discuss the differences in the classification error and 
specification error between the revised and original 
questionnaires. 

As previously noted, Biemer and Bushery focused on a 
three-category labor force recoded variable with categories: 
employed (EMP), unemployed (UEM), and not in the labor 
force (NLF). For the present analysis, we used an expanded 
recoded variable also available on the CPS public use data 
files. This variable divides the UEM category into two 
categories corresponding to persons on layoff (LAYOFF) 
and persons looking for work (LOOKING). The seven- 
category variable also divides the EMP and NLF categories 
into subcategories; however, this level of detail in the EMP 
and NLF categories is not needed in our analysis. Thus, the 
seven-category variable will be collapsed to a four-category 
variable corresponding to EMP, UEM-LOOKING, UEM- 
LAYOFF, and NLF. The correspondence between the 
three- and four- category variables is shown in Figure 1. 


3. LATENT CLASS MODELS FOR CPS 
CLASSIFICATION ERROR 


Markov latent class models were first proposed by 
Wiggins (1973) and refined by Poulsen (1982). Van de Pol 
and de Leeuw (1986) established conditions under which 
the model is identifiable and gave other conditions of 
estimability of the model parameters. In this section we 
describe the basic model proposed by Biemer and Bushery 
(2000) and its extensions for application in the current 
analysis. 

Let the CPS target population be divided into L groups 
(such as age, race, or sex groups) and let the variable G be 
the label for group membership. For example, G,; =1 if the 


Original Seven- Variable Category Four-Category Three- 
Old Questionnaire New Questionnaire Analysis Variable Category 
Analysis 
Variable 
1. Working—at work Employed—at work lL. EMP: 1. EMP 


family farm or business) or temporarily 
absent from a without pay job 
6. Unavailable to take a job if one had been 6. 


offered 
7. Not in the labor force ik 


1 
2. With job—not at work 2. _ Employed—absent 
3. Unemployed—on layoff" 3. Unemployed—on layoff 2. UEM—LAYOFF 27 BM 
4. Unemployed—looking for work! 4. _ Unemployed—looking 3. _UEM—LOOKING 
5. Working without pay (less than 15 hoursina 5.  Retired—not in labor force 4. NLF 3s NLE 


Disabled —not in labor force 


Other — not in labor force 


‘Note: In the original questionnaire, categories 3 and 4 are reversed compared to corresponding categories in the revised 


questionnaire. 


Figure 1. Association of the Seven-Category Employment Recode Variable with the Three- and Four-Category Variables 


Used in the Analysis 
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e population member is in group 1, G, =2 for group 2 and 

so on. Let X,,;,Y,;, and Z,, denote the true labor force 
. A th 4 

classifications for the 7 person in group G=g (for 


ey Les _Lyand b=, .c.5M, ) where X i is defined as 


1 if person (g,i)is employed in time period | 
2 if person (g,i) is unemployed — 
on layoff in time period 1 
3 if person (g,/) 1s unemployed — 
looking in time period 1 
4 if person (g,7) is not in the labor force 


in time period | 


with analogous definitions for Y,,; and Z,; for periods 2 
and 3 respectively. Consistent with the conventions of the 
LCA literature, we will drop the subscripts from the 
variables to simplify the notation. 

let m,,;, denote Pr(X =x,¥=y,Z=z |G=g), let 
My,, denote PriY=y|X =x,G=g)and let x,,,, 
denote Pr(Z = z| Y = y, X =x,G =). Then, the probabi- 
lity that an individual in group g has labor status x in period 
1, y in period 2, and z in period 3 is Tl vag which may be 
written as 


Maye = Trig Mylex Melgry- (1) 


Finally, under the first order Markov assumption, which is a 
necessary condition for model identifiability (see Van de 
Pol and de Leeuw 1986), we assume 
TM tery — Waey (2) 

i.é., at period 3, the true status of an individual does not 
depend on the period | status, once the period 2 status is 
known. An alternate interpretation is that the current status, 
given the prior period’s status, does not depend upon the 
prior period’s transition. 

Now, consider the observed labor force classifications 
from the CPS denoted by A B,,, and C,; for periods 1, 
2, and 3, respectively, where 


Bu? 


1 if person (g,i) is classified as EMP in time 
period | 

2 if person (g,i) is classified as VEM — 
LAYOFF in time period 1 

3 if person (g,i) is classified as VEM — 
LOOKING in time period | 

4 if person (g,i) is classified as NLFin 


ee 


time period | 
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with analogous definitions for the response indicators, B.,;, 
and C,, for periods 2 and 3, respectively. Using an 
extension of the notation established above, we denote the 
response probabilities in each of these classifications as 
Majer = Pr(A=a|X =x), with analogous definitions for 
Ty and 4... Thus, 2 , is the probability that the 
CPS classifies a person in group g as employed (A=1) 
when the true status is unemployed — on layoff (X = 2). 
Likewise, 7,-,, ,-. 18 the probability that the CPS correctly 
classifies a person in group g as unemployed — on layoff. 
Finally, we assume 


a=\|¢ ,x= 


Na bclgxyz ~ Malex Mey Mec (3) 
or that classification error in the observed labor force status 
is independent across the three months. 

The CPS labor force classifications for each month of a 
three consecutive month interval are the outcome variables 
in our analysis. Let A, B, and C denote the observed 
classifications and let X, Y, and Z denote the (unobserved) 
true classifications for Month 1, Month 2, and Month 3, 
respectively. Let G denote some grouping (or stratification) 
variable to be defined later in the analysis. Under these 
assumptions, we can write the probability for classifying a 
CPS sample member in cell (g, a, b, c) of the GABC table as 
follows: 


T gabe ms ys 1. Trig T yl ex T elegy TM algx TM bley Melez: (4) 


X,Y,Z 


Extensions to more than one grouping variable are 
straightforward. 

Under multinomial sampling, the likelihood function for 
the GABC table is 


Pr(GABC)=C [| 1, (5) 


g,a,b,c 
g.a,b,c 

where C is the multinomial constant and II denotes the 
product of the terms over the subscripts g, a, b, and c. Under 
the assumptions made previously, the model parameters are 
estimable using maximum likelihood estimation methods. 
Van de Pol and de Leeuw (1986) provide the formula for 
applying the E-M algorithm to estimate the parameters of 
this model and describe the conditions for their estimability. 
The ¢€EM software (Vermunt 1997) was used to fit the 
MLCA models. 

In their investigations of the validity of MLCA estimates 
for analyzing CPS labor force classification error, Biemer 
and Bushery analyzed CPS data collect during the first 
quarter of each of three years — 1993, 1995, and 1996. They 
also conducted several types of analysis using the CPS un- 
reconciled reinterview data for the same time period. The 
reinterview analysis provided another approach for 
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estimating CPS classification error as well as evidence of 
the validity of the MLCA approach. Their evaluation of 
MLCA validity considered five criteria: (1) model diag- 
nostics, (2) model goodness of fit across years of CPS, (3) 
agreement between the model and test-retest estimates of 
response probabilities, (4) agreement between the model 
and test-retest estimates of inconsistency, and (5) plausi- 
bility of the patterns of classification error. The MLCA 
method performed well in all five test. For example, the 
same model provided the best fit of the data for each year 
analyzed, there was good agreement between the latent class 
estimates of reliability and those derived from traditional 
test-retest methodology; and the estimated error rates were 
consistent with those of previous studies — for e.g., Chua 
and Fuller 1987; Abowd and Zellner 1985; Porterba and 
Summers 1995; and Sinclair and Gastwurth 1998. 

Ostensibly, the Markov assumption seems very unlikely 
to hold for labor force data. As an example, persons who are 
unemployed in months 1 and 2 of a consecutive three- 
month period may not have the same probability of being 
unemployed in a month 3 as persons who just became 
unemployed in month 2. The former group could contain 
more chronically unemployed persons than the group 
entering unemployment in month 2. Further, the group just 
entering unemployment in month 2 could contain a higher 
proportion of people temporarily out of work while 
changing jobs. Biemer and Bushery considered the 
consequences for the MLCA estimates of misclassification 
when the Markov assumption is violated. 

Using simulation, Biemer and Bushery found that the 
bias in MLCA estimates of classification probabilities 
depends upon the severity of the departures of the CPS data 
from the Markov assumption. They defined two parameters, 
dX, and 2,, which are ratios of conditional probabilities. 1, 
is the ratio of the probability of being employed in period 3 
for a person with an (EMP, UEM) pattern for periods 1 and 
2, respectively, divided by the probability of being 
employed in period 3 for a person with a (EMP, EMP) 
pattern. Similarly, 2, is the ratio of the probability being 
employed in period 3 for a person with an (UEM, UEM) 
pattern to the probability of being employed in period 3 for 
a person with a (EMP, UEM) pattern. Note that when 
Xk, =A, =1, the Markov assumption holds exactly and 
greater departures of 2, and A, from | correspond to 
greater departures of the data from the Markov assumption. 
Biemer and Bushery found that over a fairly wide range of 
values for 4, and A,, the absolute bias in the MLCA 
estimates of unemployment classification accuracy never 
exceeded 3 percentage points. For example, in the extreme 
case of a Markov assumption violation, the expected value 
of an MLCA estimate of unemployment accuracy would be 
77 percent when the true parameter value is 80 percent. 


Their results suggest that, for the CPS application, MLCA is 
fairly robust to failures of the Markov assumption to hold. 

Although it is virtually impossible to prove their validity, 
MLCA error estimates can be quite useful for identifying 
survey questions that are prone to classification error; i.é., 
flawed questions. For example, Biemer (2004) and Biemer 
and Wiesen (2002) demonstrate the utility of MLCA 
methodology for identifying question problems and 
classification process deficiencies in large scale surveys. 
Notwithstanding that the MLCA assumptions may be 
violated to an unknown extent, its usefulness as a tool for 
exploring a number of important questionnaire design issues 
has been well-documented. For the present application, 
MLCA will be used to develop and test hypotheses 
regarding the sources of the anomaly reported by Biemer 
and Bushery for 1994 CPS redesign. 

The MLCA model use in the present analysis is 
essentially the same model selected by Biemer and Bushery 
for their analysis. To account for population heterogeneity, 
they considered a number of demographic and other 
explanatory variables that might be highly correlated with 
classification error. The best performing variable a proxy or 
self-response indicator variable denoted by P where 


1 if all three interviews are conducted by self 
response (SELF’) 

2 if two of the interviews are conducted by 
self response (MOSTLY SELF) 

3 if two of the interviews are conducted by 
proxy response (VOSTLY PROXY) 


4 if allthree interviews are conducted by 


proxy response (PROXY ). 


Their empirical findings showed this variable to be strongly 
related not only to reporting accuracy, but also current 
employment status and month to month employment 
transitions. For example, responses for the PROXY group 
were considerably less accurate than for the SELF group 
and, further, the PROXY group had somewhat higher 
unemployment than the SELF group. 

The MLCA model also allows transition probabilities to 
vary by P (referred to as group heterogeneity) as well as by 
time periods (referred to as non-stationary transitions). In 
addition, the model assumes that response probabilities 
and m,,, are group-heterogeneous but are 
equal for all three months in the time interval. This leads to 
the following model for describing the cell probabilities in 
the PABC table: 


= 2, Mp 


TM alpx ? Th) ox ? 


gr AlPX ap AIPX op AlPX (6) 


Tl ya,b,c Myipx™ elpy Malpx “olpy “elpz 
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where mj)" =Pr(A=b| P= p,X = y) with similar defini- 
tions for i and mie . That is, the three sets of response 


probabilities are equal to 2“. 


Note that for the present analysis, interest is focused on 
the overall response probabilities associated with the revised 
and original questionnaires and not the variation in error 
rates across proxy groups. Therefore, our analysis focuses 
on the overall accuracy of response, i.e., as or the mean 


response probability for the four levels of P combined. 


4. COMPARISON OF REVISED AND ORIGINAL 
QUESTIONNAIRE CLASSIFICATION ERROR 
PROBABILITIES 


4.1 Reduction in UEM Classification Accuracy for 
the Revised Questionnaire 


As mentioned in section 2, the CPS data sets for this 
analysis are monthly samples from August 1992 through 
May 1995. Figure 2 shows how this the time interval was 
divided into 30 overlapping three-month intervals: 15 for 
the original questionnaire and 15 for the revised question- 
naire. The intervals are numbered in the table for later 
reference. For example, time interval 1 covers the period 
from August 1992 through October 1992 in which the 
original questionnaire was in use. Therefore, this time 
interval can provide one estimate of the response 
probabilities, 7“ , for the model in (6). Since there are 30 
time intervals across the entire 34—month period in our 
analysis, 30 estimates of 2“ can be formed from these 
consecutive overlapping time intervals: 15 estimates for the 
original questionnaire and 15 estimates for the revised 
questionnaire. 

To obtain a more stable estimate of a“ for each 
questionnaire, the 15 estimates corresponding to the 15 time 
periods per questionnaire in Figure 2 were averaged. These 
estimates are shown in Tables 1 and 2. Since they are based 
on simple random sampling assumptions, the standard 
errors in the tables do not account for the unequal weighting 


|X 


Months Using 


4 (Old), 19 (New) 
; 


3 (Old), 28 (New) 
14 (Old), 29 (New) 
15 (Old), 30 (New) 


Questionnaire 1992 1992 1992 1992 


Month Using New Jan. Feb. March Apr. 
Questionnaire 1994 1994 1994 1994 
Interval 
1 (Old), 16 (New) x xX x 
2 (Old), 17 (New) x Xx xX 
3 (Old), 18 (New) x x 

x 
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and clustering effects of the CPS. Since the average CPS 
design effect is about 1.5 for estimates of unemployment, 
the standard errors in the tables are probably understated by 
20 percent or less. This level of bias in the standard errors 1s 
inconsequential for the purposes of this paper due to the 
extremely large sample sizes in the analysis. 

Table 1 compares the MLCA estimates of the classifica- 
tion error probabilities for the original and revised question- 
naire versions for the three-category labor force classify- 
cation scheme used by Biemer and Bushery. The first 
column of the table is the true (or latent) category, the 
second column is the observed (or CPS) category, and the 
cell entries are the response probabilities estimated from the 
MLCA using model (6). For each true class (EMP, UEM, or 
NLF), the accuracy rate is the cell corresponding to the 
observed category with the same label. For example, the 
accuracy of classifying persons who are truly employed is 
98.68 percent (for the original questionnaire) and 98.84 
percent for the revised questionnaire. Note that this entry 
corresponds to the cell where both the true category and the 
observed category are EMP. The other cells for EMP in 
column 1 are the error rates for EMP. For example, the 
MLCA estimate of the probability CPS classifies a person 
as UEM who is truly EMP is 0.42 for the original 
questionnaire and 0.39 for the revised questionnaire. The 
other cell entries are interpreted analogously. 

Consistent with Biemer and Bushery’s findings, the 
accuracy of the classification of unemployed persons is 
substantially and highly significantly lower for the revised 
questionnaire: 79.06 percent versus 73.50 percent, a 
difference of 5.6 percentage points. Further, the increase in 
classification error for unemployed persons is due to 
misclassifications in both the EMP and NLF force 
categories with slightly more misclassification in the latter 
category. Our estimates differ slightly from theirs since, as 
noted earlier, we are analyzing more months of data and 
using weighted estimates rather than unweighted as in their 
analysis. 


1992 el 993 1993. 1993 1993 1993p yii993 
May June .... Jan. Feb. March Apr. May 
1994 1994 1995 1995 1995 1995 1995 
xX 
x x 

: 


mmx XS 
~ 


nDhesis,7 symbol is used in this table to indicate that the pattern established for the preceding months continues for the remaining months. 


Figure 2. The 30 Three-Month Time Intervals Analyzed for the Revised and Original Questionnaires 
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Table 1 
Comparison of CPS Labor Force Response Probabilities for the Original and Revised Questionnaires 
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True Class Observed Original 
Class (1992-1993) 

EMP EMP 98.68 
UEM 0.42 
NLF 0.90 

UEM EMP 8.23 
UEM 79.06 
NLF 12.71 

NLF EMP 2.14 
UEM 1.43 
NLF 96.43 


* Significant at a = 0.001. 


Revised Original — Revised 

(1994—1995) Diff S.E. 
98.84 = 045 0.40 
0.39 0.03 0.40 
0.78 0.13 0.16 
10.57 ae 0.45 
73.50 5.56 0.54 
15.93 2332" 0.26 
1.99 0.15 0.36 
1.56 ule 0.33 
96.45 40.02 0.18 


Table 2 
Comparison of Two Unemployed Subcategories for the Original and Revised Questionnaires 


True Class Observed Class Original Revised Original — Revised 
(1992-1993) (1994-1995) Diff _ S.E. 
UEM—LAYOFEF EMP 16.32 26.67 10359 0.91 
UEM - Layoff 61.30 55.63 5.66 1.03 
UEM - Looking 17.61 8.41 9.20" 0.45 
NLF 477 9.29 Bie 0k 0.28 
UEM—LOOKING EMP 7.03 Ton — 0.48 0.29 
UEM - Layoff 1.03 0.65 0.38 0.26 
UEM - Lookling 78.00 74.61 3.39" 0.21 
NLF 13.94 17.23 ees 0.18 


* Significant at a = 0.001. 


Table 2 shows the same set of estimates for the truly 
employed population only in somewhat greater detail. In 
this table, we considered the two primary subclassifications 
of unemployed: UEM-LAYOFF and UEM-LOOKING. 
This table provides information regarding the source 
difference in accuracy rates between the two questionnaire 
versions. We first consider the misclassification of true 
LAYOFF persons (top half of the table) and then consider 
the LOOKING persons (bottom half of the table). 

For persons on layoff, classification accuracy appears to 
have dropped an average of 5.66 percentage points with the 
introduction of the revised questionnaire: from 61.30 
percent to 55.63 percent. However, the patterns of classify- 
cation error also changed. For the original questionnaire, the 
probability that a person on layoff is misclassified as 
looking for work is estimated at about 18 percent. The 
corresponding estimate for the revised questionnaire is less 
than half that: 8.5 percent. In addition, the data suggests that 
misclassification of unemployed persons on layoff as either 
employed or not in the labor force increased by 10.35 and 
4.52 percentage points, respectively. 

Now consider persons who are truly looking for work in 
the bottom half of Table 2. According to the MLCA model, 
classification accuracy for the redesigned CPS decreased 
significantly from 78.00 to 74.61 percent. Most of the 
misclassification is attributed to misclassifying persons 


looking for work as NLF. This result would arise, for 
example, if the questions regarding active and passive job 
search activities are prone to error. To further investigate 
this finding, we conducted an analysis of each of the 
questions used to determined the LOOKING recode. In the 
next section, we first consider the sources of error in the 
LAYOFF classification and then investigate the sources of 
error for the LOOKING classification. 


4.2 Specific Questions Responsible for the 
Reduction in LAYOFF Accuracy 


4.2.1 Decomposition of the LAYOFF Recode 


Individuals in the CPS are classified as LAYOFF on the 
basis of their responses to five questions in the original 
questionnaire and eight questions in the revised question- 
naire. These questions are listed in Figure 3. Initially, we 
consider which questions or combinations of questions 
contribute most to the error rate observed in Table 2 for the 
LAYOFF recoded variable and then show how MLCA 
models can be applied to estimate the contributions to 
classification error of individual questions that are used to 
classify an individual as LAYOFF. The methodology 
employed for this is similar to the MLCA approach used 
previously for estimating the aggregate classification error. 
We will describe this technique in terms of the LAYOFF 
classification, but it will be applied subsequently to 
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decompose the error in both the LAYOFF and LOOKING 
classification processes. 

First, we combine the questions in Figure 3 using the 
logical operators such as “and,” “or,” “if-then-else,” etc. to 
form a set of dichotomous “compound” questions with the 
property that each compound question must be answered 
positively in order for an individual to be classified as 
LAYOFF by the CPS _ classification process. Let 
Q,,k =1,...,K denote the outcomes to the K compound 
questions that were formed for the LAYOFF classification, 
where Q, =1 denotes a positive outcome and Q, =2 
denotes a negative outcome. Then an individual in the CPS 
is classified as LAYOFF if and only if Q,=1 for 
k =1,...,K. In Figure 4, we define a set of four compound 
questions for original questionnaire, labeled O1—O4, and 
five compound questions for the revised questionnaire, 
labeled N1—N5. 

For each classification, Q, there is a corresponding true, 
unobservable (latent) classification, 7, defined in analogy 
to Q,; ie., an individual is truly on layoff by the CPS 
definition if and only if 7, =1,k=1,...,K. Next, we will 
use MLCA to estimate the misclassification error rates for 
each compound question Q, by treating these as indicators 
for the unknown true latent characteristics, T,. 

The probability of an error in the classification of 
LAYOFF can be written as 


ee Or SOMC Ak iach | Joke lo, Ke ( 2) 
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K if 7, =1,T, =1,...,Ty =1. 


For example, W =0 if a person’s true response pattern to 
the questions O1—O4 1s (2,2,2,2), W =1 if the true response 
pattern is (1,2,2,2), and so on. Note that W=K 
corresponds to a true layoff. Thus, for the original 
questionnaire, We=0,...,4 and for the revised 
questionnaire, W =0, ...,5. 

To decomposing the probability in (7) into individual 
components for the compound question, Q,, we rewrite (7) 
in terms of the error probabilities associated with each 
compound question. Thus, it can be shown that (7) can be 
rewritten as 


K 


igh peat Ou One Ww =K). ©) 


k=l 


The k” term in the sum may be interpreted as the 
contribution of question @Q, to probability of being 
misclassified given a true LAYOFF. 

To estimate the components of (9) using MLCA, we 
define a classification variable, R, which is defined in 
analogy to W for the observed values of Q,,; i.e., 


Q22E Could you have taken a job LAST WEEK if one had been offered? 


0 1b SO) SRO 2S 254. OF = 
which is the probability that an individual who is truly on ee if OQ, =1,Q, =2,...,0, =2 (10) 
layoff answers at least one the K compound questions .@IC... 
negatively. K IO nO, =F Og) 
Next, we define the latent variable, W, as the number of 
compound questions for which the true response is positive, 
1.€., 
Original Question Wording 
Questionnaire 
Q19 What were you doing most of LAST WEEK? 
Q20 Did you do any work at all LAST WEEK not counting work around the house? 
Did you have a job or business from which you were temporarily absent or on layoff LAST WEEK? 
Q21A Why were you absent from work LAST WEEK? 
Revised 
Questionnaire 


LAST WEEK, did you do ANY work (either) for 


pay (or profit)? 


Q20B-a LAST WEEK, (in addition to the business,) did you have a job, either full or part time? Include any job from which you 
were temporarily absent. 


Q20B-b LAST WEEK, were you on layoff from a job? 


Q20B-1 What was the main reason you were absent from work LAST WEEK? 
Has you employer given you a date to return to work? 


Q21A Have you been given any indication that you will be recalled to work within the next 6 months? 


Q21A-1 Could you have returned to work LAST WEEK if you had been recalled? 
Q21A-2 Why is that? 


Figure 3. Primary Components of UEM for the Original and Revised Questionnaires 


136 Biemer: An Analysis of Classification Error for the Revised Current Population Survey Employment Questions 


Compound 
Question 
Number 


Source Question(s) from the CPS Questionnaire Compound Question Response is Positive 
if Source Question Response is.... 


Original Questionnaire 


Q19: What were you doing most of LAST WEEK? 


or 


the house? 


absent or on layoff LAST WEEK? 


O03 Q21A: Why were you absent from work LAST WEEK? 


| O04 ——_|_Q22E: Could you have taken a job LAST WEEK if one had been offered? 


Revised Questionnaire 


Q20: Did you do any work at all LAST WEEK not counting work aournd 


Q21: Did you have a job or business from which you were temporarily 


Q19: Any response except working 


Temporary layoff (Under 30 days) or 
Indefinite layoff (30 days or more or no 
definite recall date) 


N1 Q20: LAST WEEK, did you do ANY work (either) for pay (opr profit)? No 
Q20B-a: LAST WEEK, (in addition to the business,) did you have a job, 


absent. 


either full or part time? Include any job from which you were temporarily 


Q20B-a: LAST WEEK, were you on layoff from a job? 


Any response except “‘retired,”’ “disabled”’, 
or ‘unable to work” 


Q20B-b: Yes 


or Or 
Q20B-1: What was the main reason you were absent from work LAST Q20B-1: “On layoff” or “slack 
WEEK? work/business conditions” 
N4 Q21: Has your employer given you a date to return to work? Q21: Yes 
or or 
Q21A: Have you been given any indication that you will be recalled to work | Q21: No and 
within the next 6 months? Q21A: Yes 
N5 Q21A-1: Could you have returned to work LAST WEEK if you had been Q21A-1: Yes 
recalled? or 
or Q21A-1: No and 


Q21A-2: Why is that? 


Q21A-2: Own temporary illness 


Figure 4. Compound Questions Used in the LAYOFF Recode for Original and Revised Questionnaire Versions 


Let jr denote Pr(R=k|W=K). Then for k>0O we 
may write 


aE (0 Na) ARE (0 met Aa 


Thus, the contributions to error of each LAYOFF question 
can be obtained from the probabilities in (11). 

To estimate the probabilities Mie we fit MLCA models 
to the same data from the 1993 and 1994 CPS as used in the 
previous analysis and replicated the analysis on the 1993 
parallel survey data. Data from the 1992 and 1995 CPS 
were not part of this analysis. The MLCA models used were 
similar to those described in the analysis for Tables 1 and 2. 
That is, we used three consecutive months of data and 
estimated the components in (10) for 10 consecutive, 
overlapping intervals for each year (i.e., January—March, 
February—April, and so on to October—December). For the 
original questionnaire, the model specified three latent 
variables corresponding to the three months within a time 
period, each with K +1=5 latent classes. For the revised 
questionnaire, we use an identical model except each latent 
variable had K +1=6 latent classes. 

As before, the best MLCA model for this analysis 
incorporated the proxy-self grouping variable, P, and 
specified non-stationary transitions, equal response 


probabilities within time period, group heterogeneous 
transition probabilities, and heterogeneous response proba- 
bilities. The model provides an adequate fit to the data for 
all months in the analysis (1.e., p > 0.05). 

Table 3 provides a summary of the results from this 
analysis. In the column labeled “percent of total’ we report 
Pp, X100 percent where 


=—tt (12) 


is the proportion of the classification error due to compound 
question k in Figure 4 and where 7)” are the MLCA 
estimates of Tite 

The contribution to total error presented in Table 3 
(Percent of Total column) is estimated by p,x 
Pr(A#2|X =2) where p, is given by (12) and 
Pr(A # 2| X =2) is estimated from Table 2 as 1 minus the 
accuracy rate for LAYOFF. For the original questionnaire, 
the components that contribute most to LAYOFF 
classification error are question O2 (64.2 percent) and 
question O1 (27.2 percent). These two questions taken 
together explain more than 90 percent of the error in the 
LAYOFF classification. 
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For the revised questionnaire, estimates from the 1994 
CPS indicate that more than 90 percent of the error in the 
LAYOFF classification arises from two components: N1 
and N4. 

The analysis for the revised questionnaire was repeated 
on the Parallel Survey with very similar results. The same 
two components emerge as contributing more than 90 
percent of the error. As mentioned in section 2, the utility of 
the 1993 Parallel Survey as an indicator of data quality for 
the revised questionnaire is in doubt. Nevertheless, the 
agreement of the results from the Parallel Survey and the 
1994 CPS adds strength to the findings from the 1994 CPS 
analysis 

Thus, reduction in LAYOFF classification accuracy for 
the revised questionnaire appears to be due primarily to 
error in the responses to two compound questions: N1, the 
revised global question “LAST WEEK, did you do ANY 
work (either) for pay (or profit)?” and N4, which determines 
whether an individual reporting some type of layoff has a 
date or indication of a date to return to work. The MLCA 
estimates indicate that almost 60 percent of the error in the 
revised LAYOFF classification maybe attributed to N1 
while about 34 percent may be attributed to N4. 


4.2.2 Decomposition of the LOOKING Recode 


The estimation process described for LAYOFF was also 
applied to the LOOKING recode. Note that compound 
question Ol, O2, Nl, and N2 defined in Figure 5 for 
LOOKING are the same questions as defined in Figure 4 for 
LAYOFF. Since Ol, O2, and NI appeared to be 
problematic for LAYOFF, we might expect that they might 
also be problematic for LOOKING. 

Following the approach used for LAYOFF, for each 
survey year, we defined a latent variable, W in (8) and an 
indicator variable, R in (9). As we did in the LAYOFF 
analysis, we fit MLCA models to the data and determined 
that the best MLCA model for the analysis is the model 
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incorporating the proxy-self grouping variable, P, and 
specifying non-stationary transitions, equal response proba- 
bilities within time period, group heterogeneous transition 
probabilities, and heterogeneous response probabilities. This 
model provides an adequate fit to the data for all months in 
the analysis (i.e., p > 0.05). As before, we include the results 
from the Parallel Survey for comparison with the 1994 CPS 
results; however, the latter results will be emphasized. 

Table 4 displays the values of p, defined in (11) for the 
LOOKING classification. For the original questionnaire, the 
major contributors to classification error appear to be 
questions O1 and O3, which contribute 31.5 and 56.3 
percent of total classification error, respectively. Question 
O2, which was quite problematic for the LAYOFF popu- 
lation, appears less so for the LOOKING population. While 
it contributes 64.2 percent of the LAYOFF error estimate 
(or 24.8 percentage points to the error rate), O2 only 
contributes 11.3 percent of the LOOKING error estimate (or 
2.5 percentage points to the error rate). 

For the revised questionnaire, the results from the 
analysis of the Parallel Survey and the 1994 CPS are again 
quite similar. The component N1 appears to be an important 
source of error for LOOKING as it was for the LAYOFF 
analysis. However, its contribution to LOOKING is smaller: 
10 percentage points compared with 25 percentage points 
for LAYOFF. The biggest contributor to LOOKING error 
seems to be question N3 which contributes 64.5 percent of 
the error based on the CCO analysis and 51.1 percent based 
on the 1994 CPS analysis. 

Thus, the initial labor force question appears to be 
problematic for both questionnaire versions. The MLCA 
suggests that persons who are looking for work as well as 
persons who are on layoff experience some difficulty 
responding to the question “LAST WEEK, did you do ANY 
work (either) for pay (or profit)?”. The changes made to this 
question in 1994 do not appear to have improved the 
accuracy of this question for the either population. 


Table 3 
Percent Contributions to Error in LAYOFF Classifications for Compound Questions for the 1993 CPS, Parallel Survey, 
and the 1994 CPS 


Question 1993 CPS Parallel Survey 1994 CPS (Revised Version) 
(Original Version) (Revised Version) 
Old Questionnaire Error Rate Percent of Total Error Rate Percent of Total Error Rate Percent of Total 
Ol 10.53 27.20 — - _ 
O2 24.84 64.19 - ~ — 
03 205 6.08 - ~ - 
O04 0.67 1.74 — - — 
New Questionnaire 
Nl - - 23.19 52.26 25.34 cs 
N2 - - 0.00 0.00 0.00 0.00 
N3 = ~ 2.76 6.22 3.06 6.90 
N4 - _ 18.42 41.52 15,078 33.98 
N5 - = 0.00 0.00 0.89 2.00 
Total 38.39 100.00 44.37 100.00 44.37 100.00 
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Table 4 
Percent Contributions to Error in LOOKING Classifications by Compound Questions for the 1993 CPS, Parallel Survey, 
and the 1994 CPS 


Question 1993 CPS Parallel Survey 1994 CPS (Revised Version) 
(Original Version) (Revised Version) 
Old Questionnaire Error Rate Percent of Total Error Rate Percent of Total Error Rate Percent of Total 
Ol 6.93 a1! - - - = 
O2 2.49 11.34 - - ~ - 
03 12.39 56.33 = ~ = - 
O04 0.18 0.83 - - = - 
New Questionnaire 
Nl = ~ 8.38 33.00 10.00 39.40 
N2 - - 0.00 0.00 0.00 0.00 
N3 - - 16.38 64.5 12:97 51.08 
N4 - - 0.46 1.81 2.21 8.96 
N5 0.18 0.71 0.14 0.56 
Total 22.00 100.00 25.39 100.00 DDO9 100.00 


Compound | Source Question(s) from the CPS Questionnaire Compound Question Response is Positive 
Question if Source Question Response is.... 
Number 


Old Questionnaire 

Q19: What were you doing most of LAST WEEK? 
or 
Q20: Did you do any work at all LAST WEEK not counting work aournd 
the house? 
Q21: Did you have a job or business from which you were temporarily 
absent or on layoff LAST WEEK? 
Q22: Has ... been looking for work during the past 4 weeks? 
and 
Q22A: What has ... been doing in the last 4 weeks to find work? 


Q19: Any response except working 
and 
Q20: No 


Q22: Yes or response to Q19 was LK 
(LOOKING) 

and 

Q22A: Response other than “nothing 
Yes or 

No, and reason is ““Already has job” or 
“Own temporary illness” 


NI 
N2 Q20B-a: LAST WEEK, (in addition to the business,) did you have a job, 
either full or part time? Include any job from which you were temporarily 
absent. 
N3 Q22: Have you been doing anything to find work during the last 4 weeks? 
N4 Q22A: What are all the things you have done to find work during the last 4 Mention of at least 1 active activity. 
weeks? 
Or 
Q22A-DK: You said you have been trying to find work. How did you go 
about looking? And 
_| Q22A-DK1: Can you tell me more about what you did to search for work? 
NS LAST WEEK, could you have started a job if one had been offered? Yes 


‘Note: Ina few cases, N2 was positive if response to Q20B-a was “Disabled” or “Unable’’ and response to Q20A-1: “Does you disability 
prevent you from accepting any kind of work during the next six months?” was ““No”’. 


Figure 5. Compound Questions Used in the LOOKING Recode for Original and Revised Questionnaire Versions 


The key difficulty for the LOOKING category appears to 5. CONCLUSIONS 
be determining whether persons who are truly looking for 
work have made efforts of any type (either passive or Biemer and Bushery (2000) provides some evidence that 


active) in the past four weeks to find work. If a respondent is unemployment classification accur. acy rates in the 1994 CPS 
classified correctly as having made some effort, the next  tedesign survey were smaller than for the original survey 
step in the process — viz., determining whether the efforts design used prior to 1994. This paper provides additional 
satisfy the definition of active looking — is not problematic evidence of their findings based upon a more extensive 
according to the estimates in Table 4. analysis of CPS data from 1992 through 1994. Our results 
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indicate that the probability of correctly classifying 
unemployed persons decreased from 79.1 percent to 73.5 
percent — a difference of 5.6 percentage points. We estimate 
that roughly 60 percent of the reduction (3.4 percentage 
points) is due to an increase in the classification error for 
persons on layoff while the remainder (2.2 percentage 
points) is due to an increase in the classification error for 
persons looking for work. 

For the revised questionnaire, both LAYOFF and the 
LOOKING classifications are each based upon five 
compound questions. For LAYOFF, two compound 
questions emerged as being problematic. One is the initial 
labor force question, which asks “LAST WEEK, did you do 
ANY work (either) for pay (or profit)?” The contribution of 
this component to LAYOFF misclassification is estimated 
to be approximately 57 percent which is more than double 
the corresponding rate for this question in the original 
questionnaire. In addition, a large error rate is estimated for 
the compound question formed by two questions: “Has your 
employer given you a date to return to work?” and “Have 
you been given any indication that you will be recalled to 
work within the next 6 months?” Approximately 34 percent 
of the estimated LAYOFF error rate is due to this combina- 
tion. Since there are no corresponding questions in the 
original questionnaire, most of the error in classifying 
persons on layoff in the revised questionnaire may be linked 
to these two questions. 

For classifying persons who are looking for work in the 
redesigned survey, two questionnaire components appear to 
contribute most to classification error: “LAST WEEK, did 
you do ANY work (either) for pay (or profit)?” and “Have 
you been doing anything to find work during the last 4 
weeks?/What has...been doing in the last 4 weeks to find 
work?” The error rates for both questions are slightly larger 
for the revised questionnaire than for the original question- 
naire. These increases, therefore, explain the slight increase 
in LOOKING classification error observed for the revised 
questionnaire. 

The error in CPS unemployment classification is well- 
documented; for example, see Chua and Fuller 1987; 
Abowd and Zellner 1985; Porterba and Summers 1995; and 
Sinclair and Gastwirth 1998. A widely accepted measure of 
reliability for the CPS — viz., index of inconsistency 
computed CPS reinterview — shows the reliability of the 
CPS unemployment classification decreased after the 
redesign. Results provided in this paper are consistent with 
these prior studies and help determine the source of the error 
in the CPS classification of the unemployed. At a minimum, 
our results provide a basis for further investigation into the 
root causes of the errors in the collection of labor force data 
in the CPS. Through cognitive laboratory experiments and 
field experiments, we may identify causes of the error in the 
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unemployment questions that would suggest ways to 
improve the questions. Such improvements could be 
implemented in a future redesign of the CPS. 
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Comment 


JEROEN K. VERMUNT ' 


1. INTRODUCTION 


I enjoyed very much reading this very well written paper. 
The topic addressed by Paul Biemer — classification errors 
in the measurement of employment status — is a very 
important one. Employment statistics belong to the most 
important macro-economic indicators and, actually, we 
would wish they would be free of error. It, however, turns 
out to be impossible to measure a person’s employment 
without error. The best that can be done is design the data 
collection in such a manner that the classification errors at 
the individual level are minimized as much as possible. The 
current paper contributes to this objective. 

An earlier study by Biemer and Bushery (2000) indicated 
that the 1993 changes in the measurement procedure that 
intended to reduce classification errors actually increased 
measurement error. In the current paper, Paul Biemer 
replicates these former analyses with a longer time series 
and with an extra employment category obtained by 
splitting the unemployed group into “on layoff’ and 
“looking for work”. The reported results confirm the earlier 
conclusions that the new procedure is worse than the old 
procedure. In a second step, Biemer tries to disentangle the 
sources of measurement error for the two unemployed 
categories by modeling the separate questions that are used 
to determine whether a person is “on layoff’ and “looking 
for work’, respectively. Sources of error are identified that 
point at possible improvements in the questionnaire. 

Because of my background, my commentary will mainly 
concern methodological and _ statistical issues. More 
precisely, I will discuss some methodological problem 
related to application of the LC Markov model, as well as 
indicate how the statistical analysis could be somewhat 
refined. It is, however, not clear whether such a more 
elegant modeling will yield very different conclusions. I 
want to stress ones more that this is a great paper. My 
critical remarks are only meant to stimulate the discussion. 


2. LATENT CLASS MARKOV: METHODOLOGY 


The main engine of the study performed by Paul Biemer 
is the LC of hidden Markov model. Several assumptions 
that may affect the encountered results have to be made 


when — as in this study — the model is applied with a single 
indicator per occasion. The assumption that is discussed in 
detail by Biemer is the first-order Markov process 
assumption. Simulation studies by Biemer and Bushery 
showed that, fortunately, estimates of classification error are 
not very sensitive to this assumption. Another assumption 
that is needed here for model identification is that the 
measurement error is constant over time. This assumption 
does not seem to be very problematic in the current study 
since we are looking for a single time-constant measure for 
classification error. Moreover, there is no good reason to 
assume that the quality of the measurement procedure 
changed over time while the procedure itself did not change 
(of course, apart from the questionnaire redesign). I am 
much more concerned about the third assumption; that is, 
the assumption of independent classification errors (ICE) 
over time (Bassi, Hagenaars, Croon and Vermunt 2000). Is 
it realistic to assume that the occurrence of a certain type of 
classification error at time point ¢ does not affect the 
probability of making the same mistake at time point ¢ + 1? 
In my opinion, this assumption is not realistic in the current 
application. For example, a respondent who makes a 
mistake because (s)he did not understand one of the 
questions will most probably (or at least be more likely than 
others) make the same error again at the next occasion. In 
my opinion, it is necessary to conduct a simulation study to 
determine the sensitivity of the estimated classification 
errors for violations of the ICE assumption. 

I have another critical remark concerning the use of the 
LC Markov model for quantifying measurement error in a 
person’s employment state. According to the model, there is 
a probabilistic relationship between an individual’s true and 
observed states. What is, however, the true state? Is it the 
true employment state occupied at a particular time point, or 
the state that would have been recorded with an error-free or 
gold-standard instrument? Or is it the state a person would 
have occupied under “normal conditions”? That is, if also 
randomness in his/her behavior is filtered out. 

I will illustrate my point with a small example. Suppose 
that there is two types (two latent segments) of coffee 
consumers: consumers who prefer brand A and consumers 
who prefer brand B, and that I belong to the brand B 
segment, which means that under normal circumstances I 
buy brand B coffee. In an interview, I am asked which 
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brand I bought last week. Suppose I report that I bought a 
brand A package of coffee, and that am neither lying nor 
making a mistake. In other words, there is no classification 
error in the sense of making a mistake: I really bought brand 
A this week (the researcher doesn’t know that of course). 
On the other hand, my behavior from this week is 
inconsistent with my preference, which means that in terms 
of measurement of my preference there is a classification 
error. This example illustrates that there are two types of 
“errors” that can be made: an error in the reporting and an 
“error” in the behavior. The “error” in my behavior of this 
week may have many causes, such as “brand B was sold 
out’, “brand A was offered at a lower price this week”’, “I 
could not find the brand B package because of changes in 
the arrangement of the supermarket’, etc. The LC Markov 
model is not able to distinguish such randomness in the 
behavior that is uncorrelated across time points from real 
classification errors. 

What does this imply for the employment application? It 
implies that an individual’s true state may be “on layoff’, 
but for some reason (by chance) this particular month (s)he 
has worked. If this “some reason” is uncorrelated with other 
“some reasons” for being in the “wrong” observed state at 
other occasions, it will be labeled classification error by the 
LC Markov model. While in the case of the measurement of 
preferences based on revealed (or stated) preferences 
correcting for randomness in behavior seems to be exactly 
what we wish to accomplish, this is clearly not the case in 
the measurement of employment status. I, therefore, have 
the strong feeling that the error rates reported by Biemer 
might be somewhat overestimated because of randomness 
in employment behavior, for instance, caused by random- 
ness in the functioning of the labor market. 

A well-known consequence of modeling individual 
change by means of a LC Markov model is that the 
estimated number of latent transitions is much smaller that 
the corresponding observed numbers. The reason for this is 
that both independent classification errors and independent 
random behavior is filtered out; that is, part of the observed 
change is attributed to these phenomena. 


3. LATENT CLASS MARKOV: MODEL 
SPECIFICATION 


Paul Biemer estimated a separate three-occasion LC 
Markov model for each of the 30 three—month data sets. 
Interview mode was used as a grouping variable in order to 
take into account some of the heterogeneity in the true 
employment distributions and classification errors. The 
reported error rates in the tables are averages over interview 
modes and rotation groups. 


I would have set up the model in a somewhat more 
elegant and less ad hoc manner. Instead of running a 
separate analysis for each of the rotation groups, I would 
have tried to build a simultaneous model for all rotation 
groups. The main problem of doing a series of separate 
analyses is that parameters that should actually be equated 
across rotation groups are now estimated without 
constraints. For example, the employment distribution in 
March 1994 should be the same in the rotation groups that 
were interviewed between January and March, February and 
April, and March and May, respectively. Moreover, the 
transition probabilities between March and April should be 
the same in the February—April and March—May rotation 
groups. This has also implications for the Parallel Survey 
groups: their time-specific latent distributions and 
transitions should be assumed to be equal to the ones of the 
standard CPS. That would have been a much better manner 
to test whether measurement error differ between the two 
questionnaires. Especially for the period in which the 
questionnaire forms overlap, it is crucial to assume equal 
latent distributions in order to be able to prevent that 
differences in measurement error appear partially as 
differences in true states. 

A similar problem of the separate analyses applies for the 
estimation of the classification errors. These are assumed to 
be time-constant within the 3—month period that a rotation 
group is interviewed, but are allowed to differ across 
rotation groups, even if they are interviewed in the same 
month. It would, of course, be much better to impose 
equality constraints across rotation groups. A consistent 
application of the time-homogeneity assumption would 
imply that — both for the old and the new questionnaire form 
— the measurement errors are constant within the full 
investigation period. 

What we, actually, need is a LC Markov model covering 
all 30 months; that is, a model for 30 instead of 3 time 
points. Such a simultaneous model for all rotation groups is 
as easily specified as a model for 3 time points. Of course, 
for each rotation group, only 3 of the 30 months are 
observed, which means that the other time points have to be 
treated as missing values. This is not a problem in the 
maximum likelihood estimation of the model parameters 
since we can simply assume that the data are missing at 
random (Vermunt 1997). Questionnaire type (old/new) 
serves as grouping variable (in addition to interview mode) 
and affects the time-homogenous classification error 
probabilities. In other words, we estimate only two sets of 
classifications errors, one for the old and one for the new 
questionnaire. Transition probabilities may change over 
time, but will be equal across rotation groups interviewed at 
the same occasions. Moreover, the initial state probabilities 
of a rotation group are not estimated as separate parameters 
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since they are defined by the current state of the latent 
Markov chain. 

A practical problem of the simultaneous modeling is that 
with so many time points it no longer possible to estimate 
the model parameters with the standard EM algorithm. With 
a variant of EM called the Baum-Welch algorithm, 
however, the model can also be applied with many time 
points (Vermunt 2003; Paas, Biymolt and Vermunt 2003). 
This algorithm is implemented in an experimental version of 
the Latent GOLD program (Vermunt and Magidson 2000, 
2003) and will be available in a next version of this 
program. 

An alternative way to implement a simultaneous model is 
as a LC Markov model for 3 occasions in which rotation 
group serves as grouping variable and in which the relevant 
across rotation group equality restrictions are imposed on 
the classification errors, transition probabilities, and initial 
state probabilities. The most complicated part of this 
approach is that it requires the use of restrictions on 
marginal probabilities (Vermunt, Rodrigo and Ato-Garcia 
2001). More precisely, the initial state probabilities should 
be in agreement with the marginal class sizes in the rotation 
groups that are interviewed at the same occasion. 

Other aspects of the modeling that could be refined are 
the treatment of missing values and the coding of the 
interview mode. It is not necessary to eliminate cases with 
missing values from the analysis as is done by Paul Biemer 
because ML estimation with missing values is straight- 
forward. As far as the interview mode is concerned, it would 
be much more elegant to work with only two categories — 
proxy and self — instead of four categories and let the 
interview mode vary across occasions within cases. In other 
words, interview mode could be used as a time-varying 
covariate. Vermunt, Langeheine and Béckenholt (1999) 
proposed such a latent class Markov model with time- 
varying covariates. 


4. MODEL FOR RESPONSE PROCESS 


It is a very nice idea to try to disentangle which questions 
in the questionnaire are causing the classification errors by 
modeling the response process itself. This may yield lots of 
valuable information for redesigning the questionnaire. I, 
however, think that the extended models for the employ- 
ment statuses “on layoff’ and “looking for work” are 
formulated in an overly complicated manner. 

The form of the created variable R is the same as of the 
outcome variable in a sequential choice analysis or in a 
discrete-time survival analysis. Answering the next question 
is fully determined by whether the current one is answered 
positively or not. The information we have is how many 
steps a person takes, which is conceptually equivalent to a 
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discrete survival time. A person “surviving” till the end is 
classified as being “on layoff’ (“looking for work’’). 

In my opinion, it is not very helpful to treat this variable 
as being generated by K latent variables (Ts). This only 
makes sense if theoretically there should be a response 
hierarchy at the latent level, which, however, because of 
measurement error, is not encountered at the manifest level. 
That is, if at the manifest level there are 2* instead of K 
possible responses. Even if is the case, it often suffices to 
conceptualize the model as a model with a latent variable 
with K + 1 classes and K indicators, a structure that is 
sometimes referred to as a probabilistic Guttman model. 

Paul Biemer recognizes the complexity of the K latent 
and K manifest variables formulation and decides to 
simplify the model. However, I assume because of his 
starting point, he decided to keep K + | latent classes. I do 
not see why so many latent classes are needed. There are not 
even so many employment states. More logical would be to 
have only two classes — “on layoff’ and “not on layoff’ 
(“looking for work” and “not looking for work’’) — since the 
questions are only intended to make this particular 
distinction. It can, of course, happen that the questions turn 
out to be informative about the type of “not on layoff’ (“not 
looking for work’’) status, in which case an extra latent class 
might be needed. What is clear to me is that K + 1 classes 
are far too many. 

I was wondering how many persons were classified as 
“on layoff’ (“looking for work’) at the various time points 
in the analysis with composite variable R as indicator. Are 
these numbers, as well as the number of transitions into and 
out off this state similar to the ones obtained with the 
standard four-state LC Markov model. In my opinion, this is 
a requisite for the validity of the calculation performed to 
obtain the figures presented in Tables 3 and 4. 

A final thing that occurred to me is the following. Why 
not building a LC Markov model using the full 
questionnaire information as is done in the second part of 
the analysis. In other words, an alternative to using the 
observed constructed classification consisting of 4 employ- 
ment categories would be to use the full set of CPS 
employment questions answered by the respondents. Such 
an analysis with multiple indicators would not only be much 
more informative, it would also make it possible to test and 
relax some of the assumptions that were made in the current 
analysis. For example, the ICE assumption could be relaxed 
for some of the questionnaire items. 


REFERENCES 


BASSI, F., HAGENAARS, J.A., CROON, M. and VERMUNT, J.K. 
(2000). Estimating true changes when categorical panel data are 
affected by uncorrelated and correlated classification errors. 
Sociological Methods and Research, 29, 230-268. 


144 Biemer: An Analysis of Classification Error for the Revised Current Population Survey Employment Questions 


PAAS, L.J., BIJMOLT, T.H. and VERMUNT, J.K. (2003). 
Extending dynamic Segmentation with Lead Generation: A Latent 
Class Markov Approach. Center Paper, Tilburg University 
(submitted for publication). 


VERMUNT, J.K. (1997). Log-linear models for event histories. 
Techniques in the Social Sciences Series, Thousand Oakes: Sage 
Publications. 8. 


VERMUNT, J.K. (2003). Multilevel latent class models. Sociological 
Methodology, 33. In press. 


VERMUNT, J.K., LANGEHEINE, R. and BOCKENHOLT, U. 
(1999). Latent Markov models with time-constant and 
time-varying covariates. Journal of Educational and Behavioral 
Statistics, 24, 178-205. 


VERMUNT, J.K., and MAGIDSON, J. (2000). Latent GOLD User's 
Manual. Boston: Statistical Innovations Inc. 


VERMUNT, J.K., and MAGIDSON, J. (2003). Addendum to Latent 
GOLD User’s Guide: Upgrade for Version 3.0. Boston: Statistical 
Innovations Inc. 


VERMUNT, J.K., RODRIGO, M.F. and ATO-GARCIA, M. (2001) 
Modeling joint and marginal distributions in the analysis of 
categorical panel data. Sociological Methods and Research, 30, 
170-196. 


Survey Methodology, December 2004 


145 


Comment 


STEPHEN M. MILLER and ANNE E. POLIVKA ' 


1. INTRODUCTION 


We are grateful for the opportunity to comment on this 
interesting paper. We will focus most of our comments on 
the empirical findings about the 1994 Current Population 
Survey (CPS) redesign, rather than a technical discussion of 
the Markov Latent Class Analysis (MLCA) methodology 
itself. 

In his article, “An Analysis of Classification Error for the 
Revised Current Population Survey Employment 
Questions,” the author applies MLCA models in an effort to 
trace the source of what he believes to be the “reduced 
accuracy of the revised classification of unemployed 
persons” after the redesign. In the CPS individuals are 
considered to be unemployed either because they are 
classified as being on layoff or because they are classified as 
looking for work. The author reports a particularly large 
reduction in the accuracy of the measurement of persons on 
layoff. Consequently, we will focus our attention on the 
classification of individuals on layoff, although similar 
comments can be made about the change in the measure- 
ment of those looking for work. In examining the accuracy 
of the measurement of those on layoff, the author assumes 
that those classified as on layoff were conceptually the same 
before and after the 1994 redesign, and that these 
individuals should exhibit the same labor force flows 
month-to-month. There are, however, many reasons why the 
improved measurement embodied in the redesign should 
conceptually change who is classified as on layoff. In 
addition, there are several factors unrelated to changes in 
question wording that could affect the composition of those 
classified as on layoff. Therefore, what the author describes 
as a reduction in accuracy due to the redesign more 
appropriately could be attributed to conceptual changes in 
those classified as on layoff, and the fact that what was 
being measured by the CPS before the redesign is not the 
same as what is being measured by the CPS after the 
redesign. 


2. IMPROVED MEASUREMENT 


One of the main reasons for the CPS redesign was to 
more accurately measure official definitions and concepts. 


Layoff was found to be an especially problematic concept, 
in that its meaning in general usage in the 1990’s — a 
permanent job separation — was very different from the 
official CPS definition — a temporary job separation with the 
expectation of recall. When the questions were originally 
written in the 1940’s, the term layoff was commonly used to 
refer to temporary spells of unemployment due to retooling 
or slowing of business conditions. Consequently, recall 
expectations were not asked about in the pre-redesign 
questionnaire. Research conducted in the 1980s and early 
1990s in preparation for the redesign indicated that 
respondents’ interpretation of layoff had become consider- 
ably broader than the official definition. Focus group 
interviews and large scale respondent debriefings found that 
between 30 and 50 percent of those who said they were on 
layoff did not expect to return to their former employers 
(Rothgeb 1982; Palmisano 1989; Polivka and Rothgeb 
1993) Also, in 1993, 5.4 percent of those classified as on 
layoff had last worked 1 to 5 years ago, and another 0.6 
percent had not worked in the last 5 years. This lack of 
recent work experience further supports the notion that 
many of those classified as on layoff prior to the redesign 
had no expectation of recall. 

To better measure the official CPS definition of layoff, 
two questions were added in the revised questionnaire 
asking about individuals’ recall expectations — “Has your 
employer given you a date to return to work?” and “Have 
you been given any indication that you will be recalled to 
work within the next 6 months?” Individuals for whom the 
answer is “yes” to either of these questions are classified as 
on layoff if they are available for work; all others are 
excluded from being classified as on layoff (these indivi- 
duals can be classified as unemployed later in the question- 
naire if they meet the active job search and availability 
criteria). 

As a result of the addition of these direct questions, a 
somewhat different group of people would be expected to 
be classified as on layoff. Prior to the redesign, a substantial 
proportion, if not the majority, of individuals classified as on 
layoff were in fact permanently separated from their 
employers. After the redesign, those classified as on layoff 
had to expect to be recalled to their former employers; thus 
the vast majority of these individuals should be only 
temporarily separated from their employers. It is not 
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surprising that these two groups of individuals would 
exhibit different month-to-month flows between labor force 
groups. It is reasonable to expect that individuals who 
expect to be recalled to their job would be more likely than 
those who are permanently separated to go from being 
temporarily on layoff to employed in consecutive months. 
Further, compared with permanently separated workers, 
those in industries in which temporary layoffs are prevalent 
would be more likely to be on layoff one month, employed 
the next month, and then laid off again. 

Month-to-month gross flows of individuals between 
labor force states indicate that there was an increase in the 
proportion of the unemployed who went to employment 
after the 1994 redesign. Specifically, in 1994, 26.6 percent 
of those who were unemployed in the first month were 
employed in the second month, compared with 23.7 percent 
in 1993. 

The author’s MLCA estimates of a supposed decrease in 
the accuracy of those classified as on layoff after the 
redesign because more individuals are classified as 
employed subsequent to being on layoff, in reality is exactly 
in accord with what would be expected with a tightening of 
the definition of on layoff, and is consistent with the 
increase in the month-to-month gross flows between un- 
employment and employment (although the increased flow 
also is in accord with a declining unemployment rate that 
was observed during the time period covered by the author’s 
study). The MLCA’s smaller, but still significant, estimated 
decrease in accuracy due to more individuals on layoff 
being classified as not in the labor force after the redesign 
also is consistent with the tightening of the definition of on 
layoff through the requirement that individuals expect to be 
recalled in the next six months, given that individuals may 
adapt or change their recall expectations over time. For 
instance, when first interviewed, individuals may expect to 
be recalled in the next six months. However, in subsequent 
months, as the time from the initial separation increases, 
these individuals may no longer say that they expect to be 
recalled. If, at the same time, these individuals have not 
started searching for alternative employment, perhaps 
because they are still eligible to receive unemployment 
insurance payments, these individuals would transition to 
being not in the labor force. Alternatively, individuals may 
initially expect to be recalled; however; in subsequent 
months due either to poor weather conditions or a deterio- 
rating economic situation for their former employers these 
individuals may become more uncertain about the proba- 
bility of being recalled and thus they may not say that they 
expect to be recalled. If in later months, economic condi- 
tions for their former employers improve or the weather 
becomes less inclement, these individuals again may 
correctly feel that they will be recalled. The existence of 


changing expectations could generate a three month pattern 
where individuals truly were on layoff in the first month, not 
in the labor force the second month, and on layoff again in 
the third month. Those who were permanently separated 
from a job and were incorrectly classified as on layoff in the 
unredesigned survey would be unaffected by changing 
recall expectations. Consequently, individuals who were 
permanently separated from their jobs probably would be 
more likely to report themselves as on layoff in consecutive 
months with the unredesigned survey. The MLCA model 
would interpret this greater stability as indicating that those 
on layoff were more accurately measured prior to the 
redesign. However this greater “accuracy” would only be 
amongst those who were incorrectly classified because they 
used too broad a definition. 

The author concludes that 60 percent of the misclassi- 
fication of those on layoff in the redesigned survey is due to 
the question “LAST WEEK, did you do ANY work for 
pay?” This actually is consistent with more people being on 
temporary layoff and being recalled by their former 
employers in the redesigned survey (although if individuals 
on layoff engage in temporary employment while waiting to 
be recalled to their former employers, an increase in transi- 
tions to employment after 1994 may also be at least partially 
attributable to the broader employment question used in the 
redesigned survey). Similarly, the author concludes that 40 
percent of the misclassification of those on layoff in the 
redesigned survey is due to the expectation of recall 
questions (“Has your employer given you a date to return to 
work?” and “Have you been given any indication that you 
will be recalled to work within the next 6 months?’’). This is 
consistent with changing recall expectations and a slight 
increase in the flow between on layoff and not in the labor 
force. The author is obtaining different MLCA estimates of 
those classified as on layoff before and after the redesign 
because the composition of those groups has been changed, 
and the composition of the groups have changed in a 
manner that was desired and intended by those who 
redesigned the questionnaire. 

Further evidence of the different composition of those 
classified as on layoff can be found in a comparison of data 
that were collected to determine the effect of the redesign on 
labor force estimates generated from the CPS. Prior to 
January 1994, the redesigned questionnaire was admi- 
nistered to 12,000 households monthly from late 1992 to 
December 1993. After the new questionnaire was imple- 
mented in 1994, the old questionnaire was administered 
monthly from January 1994 to May 1994 to 12,000 house- 
holds drawn from the same sample. The experimental 
administration of the old and redesigned questionnaires has 
been referred to as the “Parallel Survey”. Parallel Survey 
estimates from before 1994 using the new methodology and 
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after 1994 using the old methodology were generated to 
compare to official CPS estimates using the unredesigned 
CPS procedures prior to 1994 and the redesigned proce- 
dures after 1994. Polivka and Miller (1998) illustrate the 
importance of using both parts of the Parallel Survey to 
obtain a complete picture of the effects of the redesigned 
survey. For instance, if just the first part of the Parallel 
Survey were used, it would have been estimated that the 
redesign increased the unemployment rate by 0.5 percentage 
point. In fact, when both parts of the Parallel Survey were 
used, the redesign was estimated to have no statistically 
significant effect on the unemployment rate. 

Using both parts of the Parallel Survey and the official 
CPS estimates, Polivka and Miller estimate that the 
redesigned CPS decreased the proportion of unemployed 
men who were on layoff by a little less than 7 percent, while 
it increased the proportion of unemployed women classified 
as on layoff by almost 7 percent (although the latter estimate 
was not statistically significant at a 5 percent level). These 
estimates imply that the redesign would decrease the 
proportion of those on layoff who were male and increase 
the proportion who were female compared to the pro- 
portions that were obtained prior to the redesign, if all else 
were equal. Comparison of annual averages for those over 
the age of 20 support this notion, since they indicate that, in 
1993, 67.2 percent of those on layoff were male, compared 
to 63.6 percent of those on layoff in 1994 (although in 
addition to questionnaire changes these proportions could be 
affected by changes in economic conditions). 

The industry distribution of those classified as on layoff, 
using data from both parts of the Parallel Survey and the 
official CPS, reveals other compositional changes in those 
classified as on layoff before and after the redesign. 
Examination of estimates from the redesigned survey to the 
official CPS estimates for January to May 1993 and from 
the unredesigned survey to official CPS estimates for 
January to May 1994 reveals particularly dramatic differ- 
ences for those in the durable manufacturing industry. The 
proportion of those on layoff who were formerly employed 
in durable manufacturing when the unredesigned questions 
were used was almost half the proportion obtained when the 
redesigned questions were used (for January to May 1993 
the proportion of those on layoff who were formerly 
employed in durable manufacturing averaged 16.8 percent 
among those who received the unredesigned questions and 
9.8 percent among those who received the redesigned 
questions. For January to May 1994 the proportions were 
8.7 percent among those who received the unredesigned 
questions and 15.5 percent for those who received the 
redesigned questions). At the same time the proportion of 
those on layoff who were in construction was 10 to 15 
percent larger when the redesigned questions were used 
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compared to when the unredesigned questions were used 
(for January to May 1993 the proportion of those on layoff 
who were formerly employed in the construction industry 
averaged 33.3 percent for those who received the redesigned 
questions and 27.4 percent for those who received the 
unredesigned questions. For January to May 1994 the pro- 
portions were 33.3 percent and 25.9 percent respectively). 
Averaging the average difference between the first part 
of the Parallel Survey and the CPS for January 1993 to May 
1993 (which is equal to the new method effect plus the 
Parallel Survey effect) with the average difference between 
the CPS and the second part of the Parallel Survey for 
January 1994 to May 1994 (which is equal to the new 
method effect minus the Parallel Survey effect) indicates 
that the redesign decreased the proportion of those classified 
as on layoff who were formerly employed in the durable 
manufacturing industry by 7.3 percentage points and 
increased the proportion classified as formerly employed in 
the construction industry by 3.7 percentage points 
(averaging the average difference between the first part of 
the Parallel Survey and the CPS with the average difference 
between the CPS and the second part of the parallel survey 
is in the spirit, albeit a simplified version, of the main- 
effects limear models estimates using generalized least 
squares that were presented in Polivka and Miller). 
Individuals in different industries could have very 
different true labor force transition patterns which in turn 
could be influencing the MLCA estimates. For instance, 
given that a substantial proportion of employment in the 
construction industry is sensitive to weather conditions and 
may be more project-oriented than other types of 
employment, it is not unreasonable to expect that workers in 
construction might truly be more likely to be temporarily 
laid off in the first of three consecutive months, employed 
on a short term basis in the second month (either because 
the weather improved in the second month or because a 
short term construction project was undertaken), and then 
temporarily laid off again in the third month (either because 
weather conditions deteriorated or the project for which they 
were hired was completed). On the other hand, employment 
in the durable manufacturing industry has been steadily 
declining since the 1970’s (for example, comparing non- 
recession years, it was estimated that in 1971 14.9 percent of 
U.S. workers as measured by BLS’s establishment survey 
were employed in the durable manufacturing industry, 
compared to 9.2 percent in 1993 and 8.5 percent in 2000). 
This long term decline in employment makes it likely that a 
large proportion of workers in the manufacturing industry 
classified as “on layoff’ prior to the redesign were perma- 
nently separated from their employers (the change in the 
industry distribution when the expectation of being recalled 
was imposed is consistent with this notion). Being 


148 Biemer: An Analysis of Classification Error for the Revised Current Population Survey Employment Questions 


permanently separated from a job in combination with the 
relatively high wages workers in durable manufacturing 
received may increase the likelihood of these individuals 
being unemployed in three consecutive months, because it 
takes time to find employment in another industry at a 
similar wage. 

Comparison of MLCA model estimates before and after 
the redesign without accounting for differences in industry 
composition of those classified as on layoff could cause 
analysts to mistakenly conclude that the redesign decreased 
the accuracy of labor force classifications. In reality, the 
increase in transitions that were measured after the redesign 
represented a true increase in transitions to employment 
after layoff was properly asked about in the CPS question- 
naire. Failure to account for the fact that the redesigned CPS 
questionnaire intentionally classified a somewhat different 
group of individuals on layoff than did the unredesigned 
questionnaire could lead to incorrect conclusions being 
drawn from the MLCA models. Workers permanently 
separated from their employers who were classified as on 
layoff using the unredesigned questions are appearing to be 
more accurately classified in MLCA models, but they are 
more stable in a classification that was incorrect in the first 
place. Further, a proportion of individuals who are correctly 
classified as on layoff according to the official definition 
inherently could have less stable employment histories due 
either to their personal tastes or the industries with which 
they are associated. 

In addition to compositional changes related to differ- 
ences in question wording, the author also may have inad- 
vertently captured in his estimates several other composi- 
tional changes unrelated to wording differences. These 
include differences in the time periods the author used for 
his estimates, as well as technological changes in the data 
collection process and economic conditions. 


3. SEASONALITY 


The first inadvertent compositional difference the author 
may have introduced is related to seasonality and the 
different time frames the author used for estimation. The 
number of individuals classified as on layoff in the CPS has 
a great deal of seasonal variability, with typically a larger 
number of individuals being on layoff early in the year. For 
instance, there were 358 individuals who were classified as 
on layoff in January 1995 who matched to February and 
March, while there were 294 individuals classified as on 
layoff in March 1995 who matched to April and May, and 
only 188 people classified as on layoff in June 1995 who 
matched to July and August. This means that there were 18 
percent more people initially classified as on layoff in 
January 1995 than in March 1995 and 47 percent more 


individuals classified as initially on layoff in January 1995 
than in June 1995. Using three month moving averages 
generated with the same calendar months probably would 
help to mitigate the effects of seasonality. However, the 
author did not use the same monthly time spans to generate 
his three-month moving averages to estimate the MLCA 
models before and after the redesign. The majority of the 
author’s pre-redesign estimates were generated using data 
from August 1992 through December 1993, while the 
majority of his post-redesign estimates were generated using 
data from January 1994 to May 1995. Using these time 
spans means that the author only has, for instance, one 
January to March matched set of data for the pre-redesign 
estimates, while he has two January to March matched sets 
of data for the post-redesign estimates. 


4. TECHNOLOGICAL CHANGES IN DATA 
COLLECTION 


A second reason that the composition of the groups in 
various labor force states may be different for data collected 
with the unredesigned and the redesigned methodology is 
related to the ability to match individuals’ data from month 
to month and the quality of these matches. The vast majority 
of data collected using the unredesigned methodology either 
in the official CPS prior to January 1993 or in the Parallel 
Survey from January 1994 to May 1994 were recorded 
using a paper form, and interviewers were required to 
transcribe by hand household and person identification 
numbers from master files to the paper survey forms. All of 
the data collected using the redesign methodology, either in 
the official CPS after January 1994 or in the Parallel Survey 
in 1993, were collected using an automated instrument that 
was loaded onto either a laptop computer or on a centralized 
computer. As part of the computerized data collection 
process, household and person identification numbers were 
automatically and consistently carried forward month to 
month. Using paper forms and transcribing data by hand has 
the potential to introduce errors and cause researchers to 
eliminate as non-matches individuals who actually are the 
same individuals and thus true matches. 

Using the same public-use data that the author used, in 
combination with additional information about whether an 
individual had moved (that is periodically collected in the 
CPS), Madrian and Lefgren (1999) estimated that, 
depending on the stringency of the match criterion used, 
between 64 and 87 percent of those who were eliminated as 
an invalid match probably legitimately did match. Further, 
Madrian and Lefgren noted that there was a substantial 
decline between 1993 and 1996 in the fraction of invalid 
matches that probably should have been retained in the data 
set based on the criterion of whether an individual had 
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moved (since Madrian and Lefgren were using publicly 
released data, they were not able to investigate the validity 
of matches for 1994 to 1995 and 1995 to 1996 because the 
ability to match this data was suppressed to protect 
individuals’ confidentiality). Madrian and Lefgren suggest 
that the increased number of valid matches for 1996 onward 
was due to improvements attributable to the redesign (it 
should be noted that, although a better match can be 
obtained using data internal to BLS and the Census Bureau 
in which information has not been suppressed, the quality of 
a match using internal data still will be affected by the data 
collection methodology. Thus the quality of the match will 
be better after the redesign than before the redesign). In their 
research, Madrian and Lefgren also found that individuals 
who were incorrectly excluded from the matched data sets 
were much more likely to be young and have their 
information provided by another member of the household 
(a proxy responder). These individuals are also the ones that 
Biemer argues are more likely to have classification errors 
in their labor force status. Consequently, by potentially 
including more of these individuals in his study due to the 
improved quality of the match, the author could be 
obtaining a decrease in the accuracy of his measures that he 
incorrectly is attributing to the questionnaire. 


5. ECONOMIC CONDITIONS 


Economic conditions may also contribute to differences 
in the composition of the groups classified as on layoff 
before and after the redesign. From 1992 to 1995, the period 
which the author uses for the majority of his MLCA 
modeling, the unemployment rate was steadily declining. 
Specifically, in 1992 the annual average unemployment rate 
was 7.5 percent while in 1995 it was 5.6 percent. 

At a higher unemployment rate, it is likely that the 
proportion of individuals who remain unemployed month to 
month is larger than at lower unemployment rates. As the 
economy improves and the unemployment rate declines, it 
is not unreasonable to expect an increase in the proportion 
of individuals who transition from being on layoff to 
employment. With the increase in these transitions to 
employment, the proportion of individuals who transition to 
temporary jobs might also increase. Indeed, although 
undoubtedly related to many factors, the number of 
individuals employed in the temporary help supply industry 
(as defined under the NAICS coding system) increased 44 
percent between 1992 and 1995 — from 1.1 percent to 1.5 
percent of the U.S. establishments’ payrolls (as measured by 
the BLS’s establishment survey). 

In addition, as the unemployment rate declines, the type 
of individual classified as unemployed may change. 
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Specifically, those who remain unemployed when the 
unemployment rate is low tend to find it more difficult to 
become steadily employed and are more likely to transition 
quickly between labor force states. This is the logic behind 
studies that analyze the effects of different types of 
employment separations on subsequent labor force 
outcomes. For instance, in a study comparing individuals 
who were separated from their employers due to slack 
business conditions as opposed to complete plant shut 
downs, Gibbons and Katz (1991) found that, with regard to 
both duration of joblessness and earnings, workers who 
were separated from their employers due to slack business 
conditions did significantly worse than did those who were 
separated due to a plant closing. Gibbons and Katz argue 
that these differences were due to employers being able to 
dismiss their least productive workers, while retaining their 
more productive workers, when business conditions were 
slack, as opposed to employers having to dismiss both their 
least productive and most productive workers when a plant 
was completely shut down. Similarly, Darby, Haltiwanger 
and Plant (1985) argue that as economic conditions worsen, 
the duration of unemployment increases as a result of a 
change in the composition of those who are unemployed. 
This is because in more adverse economic conditions, the 
proportion of the unemployed who are high-skill workers 
(who also are less used to being unemployed and more 
likely to be able and willing to hold out for a more satis- 
factory job) will increase and the proportion of the 
unemployed who are less skilled and who frequently transi- 
tion between labor force states will decrease. 

It is important to note that the majority of the author’s 
pre-redesign estimates were generated using 1992 and 1993 
data, when the unemployment rate averaged 7.0 percent, 
while the majority of the redesigned estimates were gener- 
ated using data from 1994 and 1995, when the unemploy- 
ment rate averaged 6.0 percent. Changes in general 
economic conditions, and corresponding changes in the 
composition of the unemployed, may be affecting the 
supposed accuracy of the author’s estimates in a way that is 
unrelated to the questionnaire. For instance, between 1992 
and 1995, the proportion of the unemployed who were 
teenagers steadily increased from 14.8 percent to 18.2 
percent, while the overall unemployment rate steadily 
declined from 7.5 percent to 5.6 percent. Similarly, the 
proportion of the unemployed who were Hispanic steadily 
increased from 13.6 percent to 15.4 percent between 1992 
and 1995, though some of this may be due to the increasing 
proportion of Hispanics in the population (which rose from 
8.8 percent to 9.4 percent). Both teenagers and Hispanics 
tend to be lower skilled workers who historically have been 
more likely to become unemployed or withdraw from the 
labor market. It should be noted that, regardless of the 
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source, an increase in the proportion of the unemployed 
drawn from groups with less stable labor force histories will 
influence the MLCA model estimates of accuracy if the 
change is not accounted for in the modeling. 


6. DIFFERENTIAL VALIDITY OF THE MARKOV 
ASSUMPTIONS 


In addition to differences in the composition of those 
classified as on layoff affecting the estimates generated by 
the MLCA models, differences in the composition of the 
various labor force groups before and after the redesign 
could affect the validity of the underlying assumptions of 
the MLCA models. As the author notes, a key assumption 
when implementing MLCA models is that an individual’s 
transition from the second to third month is independent and 
thus uninfluenced by how the individual was classified in 
the first month. When estimating MLCA models for 
individuals’ labor force states this obviously is untrue, and 
the validity of the assumption will likely differ amongst the 
various labor force categories. For instance, an individual 
who is employed in the first month is much more likely to 
be employed in the third month than is an individual who 
has never worked. More importantly, an individual cannot 
be classified as on layoff in either the redesigned or 
unredesigned questionnaire if he or she has not previously 
worked. Addition, under the official definition of layoff that 
was implemented in the redesign, individuals also have to 
expect to be recalled. This leads to a much tighter 
relationship between employers and workers across months 
using the redesigned questionnaire. Given that individuals 
on layoff under the redesign are much more likely to be 
recalled and thus employed than under the unredesigned 
questionnaire, the likelihood of an individual’s labor force 
status in the third month depending on their initial labor 
force status in the first month is much higher. Consequently, 
not only is it likely that the Markov assumptions are often 
violated in labor force studies; it is much more likely that 
the Markov assumptions are violated after the redesign. This 
differential violation of the model’s assumptions could be 
fundamentally influencing the author’s results. 


7. CONCLUSION 


In summary, although the author believes that he 
identified a problem that was introduced into the CPS with 
the 1994 redesign, the supposed increase in misclassi- 
fication of those on layoff in reality reflects the greater 


precision of the survey questions. Rather than identify a true 
error, we believe the author may have failed to recognize 
that the composition of the groups identified as on layoff 
before and after the redesign were different due to both 
intentional changes (such as the definition of on layoff being 
built into the questionnaire or improved quality of matches 
obtained because of computerization of the survey) and to 
uncontrolled changes such as developments in the overall 
economy. Finally, we would like to see further work in this 
area which combines the MLCA modeling approach along 
with a careful consideration of the economic concepts being 
measured, the time periods being examined and _ the 
assumptions being made. We believe this could lead to a 
more accurate understanding of the effects of the 1994 CPS 
redesign, and more useful application of the MLCA 
modeling approach in general. 
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Comment 


CLYDE TUCKER ' 


1. INTRODUCTION 


I first would like to congratulate Paul Biemer for offering 
an innovative approach to the study of measurement error in 
surveys. Although he chose to illustrate his approach with 
the employment series in the Current Population Survey 
(CPS), the method can be applied to many surveys. My 
comments largely will be conceptual in nature, but I will 
supplement these comments with examples from the same 
data that Biemer analyzed. 

Using Markov Latent Class Analysis (MLCA), the 
Biemer paper relies on an evaluation of the consistency over 
time of respondents’ answers to the questions in the 
employment series. The increase in inconsistency found in 
the new series as compared to the old one, after controlling 
for self versus proxy reports, may serve as an indicator of 
one type of measurement error in the assignment of labor 
force status. Presumably, this error is the result of the failure 
of the new questions (at least, compared to the old ones) to 
collect the correct information for classifying an individual 
into the right labor force category. Thus, the error can be 
attributed to poor question design. Because the analysis 
indicates that the errors tend to be in one direction more than 
in the other — the misclassification of truly unemployed 
individuals into a different category — some might interpret 
the result to be a bias in the unemployment rate. 

I will argue that not only has bias not been introduced but 
also that the new series, while certainly not perfect, reduces 
error, providing a more accurate picture of the employment 
situation. It does this by taking into account the economic 
realities of today in a way that the old series did not. This is 
accomplished by not only better question wording but also 
by the inclusion of follow-up questions and probes that 
capture more detailed information for determining a 
respondent’s true employment status. The use of follow-up 
questions and probes is facilitated by the introduction of a 
computerized survey instrument. As a result of these 
innovations, I believe that the new employment series 
reduces the amount of specification error that existed with 
the old series. By specification error, I mean the error 
arising from using questions that do not measure what they 
are intended to measure. I also will explain why I do not 
believe that Biemer’s method is appropriate for use in this 
particular case. 


2. RECOGNITION OF THE NEED FOR A NEW 
EMPLOYMENT SERIES 


The last major revision of the CPS prior to 1994 took 
place in 1967. In the ensuing years, the labor market under- 
went a great transformation. The number of women in the 
labor force dramatically increased. The number of part-time 
jobs and multiple job holdings escalated. The relationship 
between the worker and the employer became more 
tenuous. Startling technological developments changed the 
way Americans did work and resulted in the creation of new 
types of jobs requiring new kinds of skills. Perhaps most 
importantly, the economy gradually became more service 
oriented and less manufacturing oriented. 

Just one result of these developments that needed to be 
taken into account in the CPS was the change in the 
accepted meaning of “‘layoff’ as so ably described by Miller 
and Polivka (2004), but there were others, as enumerated by 
Bregger and Dippo (1993). Better information was needed 
about discouraged workers (those who have given up 
looking for work), multiple jobholders, marginal workers 
(e.g., unpaid workers in a family business), and job- 
changing patterns. In addition, during the 1970s and 1980s, 
concern mounted about the various types of nonsampling 
errors that could be affecting CPS estimates as well as about 
respondent burden and its detrimental effect on data quality. 

Until the 1980s, the technology to tackle these problems 
was not available. However, as Bregger and Dippo (pages 
4—5) note, things began to change: 


“’..in the early 1980s, the introduction of two 
new survey methodologies provided the 
means for understanding and _ reducing 
measurement error. These included the 
application of behavioral science methods and 
theory — more commonly referred to as the 
cognitive aspects of survey methodology — 
and computer-assisted interviewing. It is 
through the blending of these two methodo- 
logies that a new collection procedure, which 
focuses on reducing measurement error, was 
made possible.” 


Cognitive methods (including focus groups and in-depth 
interviewing) made it possible to develop questions that 
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could accurately measure the more complex economic 
behaviors that the times required. Furthermore, these 
techniques were able to uncover problems in the existing 
labor force series (See Polivka and Rothgeb 1993). The 
accurate measurement of the more complex behaviors also 
required a more complicated survey instrument. One so 
complicated that interviewers, left to their own devices, 
would have difficulty navigating. This is where computer- 
assisted interviewing played an important role. With a 
computerized survey instrument, interviewers could easily 
navigate through the complex skip patterns necessary to 
obtain answers to questions for measuring the wide variety 
of economic behaviors of interest. 


3. CONSIDERATION OF NONSAMPLING 
ERRORS IN BOTH THE OLD AND NEW CPS 
EMPLOYMENT SERIES 


Let me begin this section by detailing my reasons why 
MLCA is not an effective tool for evaluating the new CPS 
design relative to the old one. MLCA can be a good method 
for detecting measurement error within a constant series of 
questions by looking for inconsistencies in response over 
several administrations to the same respondent. In the case 
of the CPS, the method might be appropriate, given a 
careful examination of a well-chosen set of diagnostics, for 
examining problems in the old employment series and the 
new employment series independently of one another. How- 
ever, let me add a caveat here about examining incon- 
sistencies even within the same employment series. Labor 
force status, in itself, is inherently inconsistent over time. 
While the employed and not-in-the-labor-force (NILF) 
categories are relatively stable, the unemployed category is 
not. Those in that category are trying to get out. Controlling 
for seasonal effects by looking at March-May of either 1993 
or 1994, it turns out that, on average, almost 90% of those in 
the employed and NILF categories did not move from one 
month to the next. On the other hand, over half of those in 
the unemployed category did. Thus, the unemployed are a 
particularly difficult group for MLCA to handle. 

As for comparing the two series, the use of MLCA is 
problematic because the two series were designed to 
measure different things. There were some significant 
changes made in the employment series in the hopes of 
reducing specification error. Although I do not want to 
dwell on the measurement of layoff (Miller and Polivka 
have covered this topic well.), I do want to use it as a case in 
point for explaining why the comparison of the old and new 
instrument is a difficult one to make. Apart from what 
Miller and Polivka have said, I have my own reasons for 
doubting Biemer’s conclusions. 


The changes in the layoff questions were designed to 
reduce the specification error discovered in qualitative 
research on the meaning of “layoff,” as alluded to by Miller 
and Polivka. In the attempt to eliminate specification error, 
two additional questions were added. One asked whether a 
date for recall had been given, and the other inquired about 
the possibility of returning to the job within the next 6 
months. Only those who were given a recall date or 
expected to return to work within the 6-month period were 
classified as truly “on layoff.” 

Clearly, this altered the characteristics of the group 
classified as unemployed as a result of layoff as well as 
those asked the remaining questions in the employment 
series, but I believe there also were more subtle reasons why 
inconsistencies in respondents’ answers could have 
increased and still not have contributed to measurement 
error to the extent argued by Biemer. In the first place, 
respondents had to answer more questions, which would 
have increased the probability that at least one false 
inconsistency would arise from one month to another. This 
might add to measurement error compared to the old series, 
but specification error, considered to be the greater problem, 
still would be reduced. Furthermore, false inconsistencies 
arising from these questions should be minimized for two 
reasons. These questions are much more specific than the 
single layoff question in the old series, and they had been 
well tested (Esposito, Campanelli, Rothgeb and Polivka 
1991). Moreover, given that more specific questions were 
asked, there would be an increased chance that true change 
had taken place in the state of at least one of them in the 
intervening month. Finally, and of greatest interest to me, is 
the fact that these questions attempt to capture information 
on relatively nuanced changes. For instance, a respondent 
may have changed his or her mind about the possibility of 
being recalled in the next 6 months based on little concrete 
information. With the uncertainties in today’s job market, it 
would be difficult to say that the respondent had given the 
wrong answer. 

I now want to address Biemer’s concerns about the initial 
question in the new employment series asking about 
whether any work was done last week for “either pay or 
profit.” His results indicate that this question may be 
contributing to the amount of error he finds in both the 
“Jayoff’ and “looking” series. The change in this question 
(as well as the addition of a question on the existence of a 
family-owned business or farm) was prompted by the 
concern that the old questions were not stated broadly 
enough, so that marginal workers, especially those working 
for profit at home, were not being classified as working. For 
example, the Parallel Survey showed the percentage of part- 
time workers in the new CPS was 1.098 times larger than in 
the old CPS, and, coincidentally, the employment to 


Survey Methodology, December 2004 


population ratio for women 65 and older also increased by 
about the same amount (Polivka and Miller 1998). The 
same is true when comparing 1993 to 1994. It stands to 
reason that the increased precision in the identification of 
these marginal workers, who are more likely to be 
inconsistent in their answers from month to month than 
other workers, might be mistaken for measurement error. 
The fact is the more narrow “what were you doing last 
week” question could lead these respondents to consistently, 
but inaccurately, report they were unemployed. 

Finally, let me tum to the other section of the 
employment series in which Biemer found a problem — the 
“looking for work” questions. One important change in this 
series involved clarifying the differences in “active” and 
“passive” job search in order to reduce misclassification 
rates in these categories. Studies conducted in the 1980s 
found that interviewers were confused about what 
constituted an active (versus a passive) job search (Polivka 
and Rothgeb 1993). In the redesigned questionnaire, 
interviewers were given an explicit list of both active and 
passive job search methods. 

Comparisons of the results of the old and new questions 
are complicated by the fact that different subpopulations 
were asked these questions in the two series. Those finally 
defined as looking (and, thus, considered unemployed) in 
the two different employment series could have arrived 
there in quite different ways. Half of those considered 
looking in 1993 received that designation by volunteering 
they were looking in the first question (“What were you 
doing most of last week?’’); none of those who were looking 
in 1994 followed that path. Those retired and 50 or older in 
1994 never got the chance to say they were looking. In 
1993, none of those who said they were on layoff were 
asked the looking question, so they had no chance to be 
classified as NILF in a given month. Then there were the 
two different levels of information given to the interviewers 
for coding active and passive methods. One difference 
uncovered in an analysis of the two groups from 1993 and 
1994 was that a higher proportion of those looking in 1994 
were women compared to 1993 (45.4% vs. 41.2%). 
Referring to the above discussion on the first employment 
question, increases in the inconsistency in reports to the 
looking questions could be the result of capturing more 
marginal workers using the revised employment series. 
Sometimes these individuals would be looking and 
sometimes not. 
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4, CONCLUSIONS 


Paul Biemer has made a bold attempt to investigate the 
error structure in the CPS employment series; however, his 
findings do not take into account the reasons for the revised 
questions. Taking these into account would help explain the 
month-to-month inconsistencies that he found. Not only 
might these inconsistencies be real, but they could provide 
evidence of a reduction in specification error. For instance, 
controls other than for self/proxy could be included in the 
model to take into account some of the changes in 
methodology, and measurement error within more limited 
subpopulations. More exploration of the utility of MLCA 
with inherently inconsistent classifications also should be 
undertaken. 
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Response from the Author 


PAUL P. BIEMER ' 


1. INTRODUCTION 


My sincere thanks to all four discussants for their 
thoughtful, thorough and constructive comments. They have 
added considerably to our understanding of the complex 
issues surrounding Markov Latent Class Analysis (MLCA) 
and the Current Population Survey (CPS) labor statistics. 
All four discussants raise a number of important issues that I 
will try to address to the extent I can. Some issues will 
require more work and deserve much greater consideration 
than is possible here. More complete responses to those 
issues will have to await the results of future research. 

Considering all the comments collectively, there seems to 
be agreement that Markov latent class analysis has 
considerable potential as a tool for evaluating and exploring 
the sources of measurement error in the CPS. However, 
there is some skepticism that it has identified real problems 
in the CPS questionnaire. Dr. Vermunt, who is also the 
author of the software I used for this analysis (viz., (EM), 
provides a number of valuable suggestions for improving 
the models and investigating the validity of the model 
assumptions. The three other reviewers (Drs. Miller, 
Polivka, and Tucker) are quite familiar with the CPS since 
they are employed by the federal agency that sponsors the 
survey where they played important roles in the 1994 
redesign. Their comments remonstrate the various ways in 
which the MLCA model assumptions could be violated for 
these data. In addition, they contain valuable information 
regarding details of the CPS (both pre- and post-redesign) 
and the construction of the CPS labor force variable. The 
comments and suggestions of all the discussants should be 
carefully considered by labor force economists and 
Statisticians who are conducting research in the area of 
employment measurement error, particularly those using 
MLCA. 


JEROEN VERMUNT’S COMMENTS 


I first address the comments of Dr. Vermunt and then the 
comments of the other three reviewers. I share Dr. 
Vermunt’s concern that the ICE assumption may not hold 
for these data. As he points out, if respondents 


| 


misunderstand the labor force questions in the same way 
from one month to the next, they may make the same errors 
each month creating correlated errors across the months. As 
an example, a person who is truly in the UEM category at 
both Times | and 2 may be more likely to be misclassified 
at Time 2 if they were also misclassified at Time 1. This can 
be stated probabilistically as 


eR BAD, aoe Qeanid aie Dyan 


= ew n@y 
P(B#2|A=2 and X =Y =2) 


The numerator probability of the quantity p is the 
probability that the Time 2 classification (B) is in error 
given the Time 1 classification (A) is also in error and the 
true classification at both time points is UEM. The 
denominator probability is similar except for the condition 
that no error is made at Time 1 (i.e., A = 2). Under the ICE 
assumption, p = 0. Therefore, if the p > O (which is the 
likely direction of the correlated error), the ICE assumption 
is violated. Dr. Vermunt suggests a simulation study be 
conducted to study the sensitivity of the estimated classi- 
fication errors to violations of this assumption. Of course, 
determining the extent to which the ICE assumption fails for 
the CPS data is not possible via simulation. Nevertheless, it 
is still useful for assessing the potential for correlated error 
to bias the MLCA classification error estimates. 

Following his suggestion, I conducted a small simulation 
study to gain some insight as to the consequences p > 0 for 
MLCA using CPS data. A sequence of artificial populations 
was generated using parameters consistent with those for the 
CPS (see for example, Table 1 in the main paper) except 
that p was increased in small increments from 0 to its 
empirical maximum — i.e., the largest value of p that is 
feasible without violating the other model assumptions. 
Maintaining the other model assumptions in the analysis is 
necessary so that the consequences of violating just the ICE 
assumption can be isolated. 

The largest feasible value of p was determined empiri- 
cally to be 0.7. At this value of p, the MLCA estimate of the 
probability a correct classification of UEM went from 79% 
to 85% and the misclassification error rate dropped from 
21% to 15%. For mild departures from the ICE assumption, 
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say 0 < p < 0.3, the error rates changed by less than 3 
percentage points. These results illustrate that if the ICE 
assumption fails to hold due to positive between interview 
correlations, the error rates estimated by MLCA will be 
somewhat underestimated. However, mild departures from 
the ICE assumption should have little effect on the 
classification error probabilities for these data. A similar 
analysis was conducted for the two other labor force 
categories (i.e., EMP and NLF) but the change in the 
classification error estimates was negligible. This result was 
anticipated due to the relatively small error rates for these 
categories. 

The results suggest that mild departures from the ICE 
assumption should have little or no effect the conclusions of 
the analysis. Extreme departures might affect the conclu- 
sions in the unlikely event that errors are highly correlated 
for original questionnaire and essentially uncorrelated for 
the revised questionnaire. Under that scenario, the original 
questionnaire would appear to have smaller UEM classi- 
fication error than the revised questionnaire. However, there 
is no practical reason to expect this condition to hold since 
both questionnaires present questions that respondents may 
misunderstand consistently across interviews. 

Although these simulation results, as well as those in 
Biemer and Bushery (2001) for investigating the cones- 
quences of violations of the Markov assumption, are quite 
useful for studying the sensitivity of the estimates to 
violations of the MLCA model assumptions, they provide 
no direct evidence of the validity of the MLCA estimates. 
Biemer and Bushery (2001) illustrate how the (empirical) 
validity of latent class estimates can be established using 
external data and alternative approaches for estimating 
classification error. A similar analysis based upon test-retest 
reinterview data will be provided in the sequel. 

For the purpose of identifying potential areas where the 
CPS questionnaire can be improved, it is not essential to 
establish unequivocally that the MLCA model assumptions 
hold since model validity is of secondary importance. 
Instead, the primary issue for questionnaire evaluation work 
is whether the method of analysis used is successful at 
identifying questions that have large measurement errors 
and are in need of revision. In other words, the validity of 
the model is established by its ability to find important flaws 
in the questionnaire. Determining whether there truly is 
error in the UEM classification as suggested by MLCA 
requires an evaluation using other methods such as 
cognitive laboratory research. Cognitive interviews could be 
used to investigate encoding, comprehension, recall, and/or 
social desirability issues that generate errors in the responses 
to the UEM questions. If these investigations uncover 
important problems in questions, then the utility of MLCA 
for identifying flawed questions will be supported even 
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though the validity of the MLCA modeling assumptions 
may never be known. 

Dr. Vermunt’s other suggestions on ways the modeling 
framework could be improved are quite reasonable and I 
hope to investigate them further in the future. However, the 
current software for fitting MLCA models is somewhat 
limited and the estimation of complex models such as those 
he suggests may not be feasible. He also notes that problems 
can arise when fitting large models with the EM algorithm. 
As an example, initially we attempted to use the proxy/self- 
response variable as a time-varying covariate in the MLCA 
models, but encountered problems in the estimation process 
such as “division by 0” errors and persistent convergence to 
local maxima. We ultimately had to abandon the approach 
in favor of the single, time invariant proxy/self grouping 
variable used in the current analysis. As new and more 
general software becomes available, the options for MLCA 
with time varying covariates as well as other model 
enhancements mentioned by Dr. Vermunt will be feasible. 


COMMENTS OF THE BLS DISCUSSANTS 


I will address the comments of Drs. Miller and Polivka 
and those of Dr. Tucker together since the reviewers are 
from the same agency (BLS) and their comments raise 
similar concerns about the analysis. The following five 
points summarized their main concerns: 


1. The modifications introduced in the new question- 
naire capture more transitions than the old question- 
naire. MLCA wrongly interprets these as errors when 
in fact they are not error. 


2. Respondents may change their minds from month to 
month about whether their employers truly indicated 
that they might be recalled to work. These changes 
should not be classified as a response error. 


3. The Markov assumption does not hold in labor force 
studies and it is violated to an even greater extent 
after the redesign than before the redesign. This 
differential violation of the model’s assumptions 
could be fundamentally influencing the MLCA 
results. 


4. The differences in the estimates of LAYOFF 
classification error before and after the redesign are 
due to the composition of the groups comprising this 
category. This composition changed after the 
redesign in a manner that was desired and intended 
by those who redesigned the questionnaire. 
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5. The increased inconsistency in reports to the 
LOOKING questions for the revised questions could 
be explained by more marginal workers being 
identified using the revised questions. Sometimes 
these individuals would truly be looking for work 
and sometimes not. MLCA misinterprets these 
ostensibly random changes as response error when 
they are not. 


Point | describes an issue that should not pose any diffi- 
culties for MLCA. The MLCA model assumes that each 
individual occupies a true labor force state which may 
change from month to month. No assumption is made that 
the transition probabilities are the same for both question- 
naires. The true initial labor force probabilities as well as the 
month-to-month transition probabilities are estimated inde- 
pendently for each questionnaire. In fact, although not 
discussed in main paper, the model estimates of the true exit 
probabilities for LOOKING and LAYOFF are in fact 
greater for the revised questionnaire than for the original 
questionnaire. Thus, a greater number of flows from one 
labor category to another for the revised questionnaire does 
not necessarily bias the estimates of classification error for 
that category in either direction. 

Point 2 suggests that whether an individual is truly on 
layoff depends upon that individual’s opinion about whether 
he or she was given an indication of possibly being recalled. 
However, this in not how the revised questionnaire defines 
the concept. An individual’s true layoff status depends upon 
whether or not the employer truly provided an indication of 
being recalled. Although the respondent’s opinion about 
what the employer indicated may change from month to 
month, the true layoff status does not change according to 
the respondent’s opinion. Flows in and out of the LAYOFF 
category due to the respondent’s opinion should be inter- 
preted as error by the model. 

Points 3, 4, and 5 could be made for any analysis 
employing MLCA. They essentially concern the potential 
bias in the MLCA estimates when month-to-month 
transitions do not behave according to the MLCA model 
and consequently real changes are misinterpreted as 
classification errors. As the reviewers note, there are at least 
three ways this can occur: 


a) the Markov assumption does not hold (point 3), 


b) there is unobserved or unexplained heterogeneity in 
the population (point 4), and 


c) employment-related behaviors for two consecutive 
months are not correlated for some persons; thus, for 
those persons, past month status does not predict the 
current month’s status (point 5 as well as a point made 
by Dr. Vermunt). 


The implications of (a) were considered in a simulation 
analysis in Biemer and Bushery (2001). Their results 
suggest that, for the CPS data, the estimates of classification 
error are quite robust to violations of the Markov 
assumption. It is unlikely, then, that non-Markovy transitions 
explain the findings of higher classification error for the 
revised questionnaire. Still, additional research is needed to 
more thoroughly understand the implications of non- 
Markov transitions for our results. 

For (b), it is quite possible for MLCA estimates to be 
biased when the compositions of the unemployed popu- 
lations are substantially different under the original and 
revised questionnaires and those differences are not 
explained by the grouping variables used in the model. 
Likewise (c) may be regarded as a special case of (b). For 
(c), the transition probabilities for some population sub- 
group are uncorrelated with the prior month’s employment 
status; instead it is correlated with other unobserved 
variables. In Jeroen Vermunt’s coffee drinker example, the 
unobserved variable is the availability of a specific brand of 
coffee at the market. At this stage of the research, we have 
not conducted simulation studies to quantify the effects of 
unobserved heterogeneity on the estimates, but this 
possibility will be examined in future work. 

However, this issue as well as the general plausibility of 
the MLCA estimates can be investigated to some extent by 
comparing the MLCA estimates with independent estimates 
from an estimation approach that is not affected by (a) 
through (c). If the findings from the alternative analysis are 
consistent with the MLCA findings, the MLCA findings 
gain credibility. As an example, test-retest reliability for the 
CPS employment classifications can be estimated both pre- 
and post-redesign using the CPS reinterview data (see for 
example Biemer and Forsman 1992 for a description of CPS 
reinterview program and these data). The validity of the 
estimates of test-retest reliability does not depend upon the 
Markov assumption or group homogeneity assumption; the 
ICE assumption, however, is still relevant for reliability 
estimation. 

Table 1 shows estimates of Cohen’s kappa measure of 
reliability for three time periods: 1992-1993, 1995-1997, 
and 2002—2003. As shown in the table, the reliability of the 
CPS classifications of unemployment dropped after the 
redesign from about 68% to 65%. The most recent estimates 
of kappa indicate reiiability has dropped to below 60%. 
These results are consistent with the results from the MLCA 
that classification error in the CPS unemployment statistics 
has worsened after the redesign. It is possible that the 
reliability estimates in Table | are biased since they also 
rely on the validity of the ICE assumption. But as discussed 
previously, in order to the results in the table to be explained 
by the failure of the ICE assumption, the ICE assumption 
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would have to hold for the revised questions but not for the 
original questions. That condition is very unlikely to occur. 


Table 1 
Estimates of Cohen’s Kappa for the CPS Before and After the 
Redesign 
Year n Cohen’s « 
1992 — 1993' 28,063 67.8 
1995 — 1997? 22,429 64.6 
2002 — 2003° 19,205 58.8 


' From Biemer and Bushery 2000. 

° Bushery and McGovern (1999). 

> Personal communication with Bac Tran at the U.S. Census 

Bureau 

Given the evidence presented here and in the main paper, 
it seems reasonable to consider the possibility that CPS 
unemployment classification error increased after the 
redesign. The next step is to conduct additional research to 
evaluate these findings and explore the possible causes for 
the error. Rather than to focus on the validity of the MLCA 
or test-retest reinterview models, the focus of the future 
research should be the revised CPS questions, particularly 
those used in the LAYOFF classification. 

I have already mentioned the possibility of using 
cognitive interviews to investigating the problems in the 
response process associated with the revised questions. As 
an example, one question identified in the MLCA as being 
potentially flawed is: “Have you been given any indication 
that you will be recalled to work within the next 6 months?” 
Some of the issues that could be investigated in the 
cognitive laboratory for this question include: 


— How well do unemployed subjects understand the 
meanings of terms such as “any indication” and 
“recalled?” 


— Do subjects who were recently separated from 
employment have difficulty remembering what their 
employers said about being recalled when they were 
terminated? 


— An employer may say, “If business improves, we 
may call you.” Do respondents answer the question 
correctly in this situation? 


— Do respondents who initially respond that they will 
be recalled later change their responses to this 
question as the months pass by and they have not 
been recalled? 


SPECIFICATION ERROR AND MEASUREMENT 
ERROR 


Finally, I will address an important issue raised by Dr. 
Tucker regarding specification error, measurement error and 
their net effects. As Dr. Tucker explains, the original 
questionnaire suffered from specification error bias caused 
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by measuring the wrong concept. The revisions to the labor 
force questions introduced in 1994 were designed to 
eliminate the specification error bias by refining the 
concepts of employment and unemployment and modifying 
the survey questions to reflect these refinements. These 
modifications, while reducing specification error, added 
more complexity to the survey questions which could have 
increased the measurement error bias in the labor force 
estimates. Dr. Tucker suggests that while this may be the 
case, the measurement bias in the new employment series 
may be less than the combination of specification bias and 
measurement bias in the old series. To determine whether 
this could be true, the specification error bias (B,) and 
measurement error bias (B,,) were separately estimated 
using the MLCA estimates provided in the paper as 
described below. 

Let p denote the CPS estimate of UEM and let P denote 
the expectation of p with respect to sampling and 
measurement error distributions. Let a denote the true value 
of the characteristic under the definitions of UEM implied 
by the specific questionnaire (i.e., without regard to possible 
specification error). Therefore, = P—B,,, i.e., the value 
of P in the absence of measurement error bias. 

As noted above, specification error bias is the bias in P 
due to a wrong concept or definition of unemployment 
implied by the questions and/or labor force classification 
process. For the revised questionnaire design, we assume 
that the specification error in p is 0 since it will be regarded 
as the gold standard for estimating the specification error 
bias in the original questionnaire. 

Let m,, and 7z,., denote the a-parameter for the 
original and revised questionnaires, respectively. Then the 
specification error bias in the pre—1994 estimates of the 
unemployment rate is 


B, = Told — Tew: (2) 


For each questionnaire, the estimate of P is p, the 
weighted estimate from the CPS. The estimate of 7 is 
obtained by correcting p for classification error bias using 
the response probabilities from the MLCA. Let p’= 
(P1,P>»P3) Where p,,p,,p3 denote the estimates of the 
proportions in EMP, UEM, and NLF, respectively. Let ; 
be the probability that an observation that truly belongs to 
the i category is assigned to the j'" category and let 7; 
denote the true proportion in the population in the i : 


category. Then 
E(p) =Q’n (3) 


where w= (Te TR) and Q=[a;] is the 3 x 3 matrix 
with elements @,. It follows that an estimator of 7 is 


t= (Q’)'p (4) 
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where Q is a MLCA estimate of Q. For each question- 
naire, (2 was estimated by the average of the 10 MLCA 
estimates (January-March through October-December) 
using the 1993 CPS for the original questionnaire and 1993 
Parallel Survey for the revised questionnaire. 

Table 2 shows the results of this analysis. For UEM, p = 
6.38 for the original and 6.98 for the revised questionnaire. 
If the unemployment rates are corrected for measurement 
bias using (4), unemployment rate increases to 7.09 percent 
for the original questionnaire and 8.03 percent for the 
revised questionnaire. Thus, an estimate of the measurement 
bias for the original survey is 6.38 — 7.09 = — 0.71 and for 
the revised survey is 6.98 — 8.03 = —1.05. Note that the 
measurement biases are negative for both the original and 
revised questionnaires, indicating that UEM as well is 
underestimated by both questionnaire versions. 

For the revised questionnaire, the specification bias is 
assumed to be 0. For the original questionnaire, it is 
estimated by the difference 7.09 — 8.03 =— 0.94 percent. An 
estimate of the net bias, B,=B,+B,, is —0.71 + 
(— 0.94) = —1.65 percent for the old series compared with 
—1.05 +0 = —1.05 percent for the new series. Thus, while it 
is subject to greater measurement error bias, the new series 
has smaller estimated net bias assuming B, = 0. 

Several limitations of these results should be mentioned. 
First, as noted in the main paper, the estimates for revised 
questionnaire from the Parallel Survey may not be 
representative of the revised CPS series. Second, the 


analysis assumes that the revised questionnaire is the gold 
standard for estimating the specification error bias in the 
original questionnaire. This assumption could also be 
challenged. Finally, no standard errors were provided for the 
estimates in Table 2 and the hypothesis of smaller overall 
bias in the revised question was not formally tested. Despite 
these limitations, the results suggest the possibility that the 
new unemployment series could have substantially lower 
net bias than the old series. 


Table 2 
Comparison of Original and Revised Questionnaire Biases for the 
CPS Unemployment Rate Based Upon Estimates from the 1993 
CPS and the Parallel Survey 


P 3 By Bs Br 


1993°CPS 6.38, 97.09" g— 0.71 — 0.94, —1.65 


Parallel Survey 6.98 8.03 —1.05 0! =105 


'Note: Specification error bias is assumed to be 0 for the revised questions. 
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A New Algorithm for the Construction of Stratum Boundaries in Skewed 
Populations 


PATRICIA GUNNING and JANE M. HORGAN ' 


ABSTRACT 


A simple and practicable algorithm for constructing stratum boundaries in such a way that the coefficients of variation are 
equal in each stratum is derived for positively skewed populations. The new algorithm is shown to compare favourably with 
the cumulative root frequency method (Dalenius and Hodges 1957) and the Lavallée and Hidiroglou (1988) approximation 


method for estimating the optimum stratum boundaries. 


KEY WORDS: Efficiency; Geometric progression; Neyman allocation; Stratification. 


1. INTRODUCTION 


A stratified random sampling design is a sampling plan 
in which a population is divided into mutually exclusive 
strata, and simple random samples are drawn from each 
stratum independently. The essential objective of strati- 
fication is to construct strata to allow for efficient esti- 
mation. In what follows X represents the known strati- 
fication or auxiliary variable while Y represents the 
unknown study variable. Suppose there are L strata, con- 
taining N, elements from which a sample of size n, is to 
be chosen independently from each stratum (1<h< L). We 
Witooy = > oN, aid m= >, qt... an the: case of the 
stratified mean estimate, 


16 
Np 
=) —+j, 1 
Vs i N Yn ( ) 


where y, is the mean of the sample elements in the n* 
stratum, we need to choose the breaks in order to minimise 


its variance 
. Nj, ; Ny, Son 
V y = ae {-— ire) ps 
(y,,) ¥ (%) ( | (2) 
where 
Ni, og 
Ssh = bas Y,)’/N,. 


is the mean. 


Dalenius (1950) derived equations for determining 
boundaries when stratifying variables by size, so that (2) is 
minimised, but these equations proved troublesome to solve 
because of dependencies among the components. Since then 
there have been numerous attempts to obtain efficient 
approximations to this optimum solution. The first such 
approximation, suggested by Dalenius and Hodges 
(1957, 1959), constructs the strata by taking equal intervals 
on the cumulative function of the square root of the 
frequencies; this method is still often used today. Eckman’s 
rule (1959) of iteratively equalising the product of stratum 
weights and stratum ranges was found to require arduous 
calculations, and is less used than the method of Dalenius 
and Hodges method (Nicolini 2001). Lavallée and 
Hidiroglou (1988) derived an iterative procedure for 
stratifying skewed populations into a take-all stratum and a 
number of take-some strata such that the sample size is 
minimised for a given level of reliability. Other recent 
contributions include Hedlin (2000) who revisited Ekman’s 
rule, Dorfman and Valliant (2000) who compared model- 
based stratified sampling with balanced sampling, and 
Rivest (2002) who constructed a generalisation of the 
Lavallée and Hidiroglou algorithm by providing models 
accounting for the discrepancy between the stratification 
variable and the survey variable. 

In the present paper we propose an algorithm which is 
much simpler to implement than any of those currently 
available. It is based on an observation by Cochran (1961), 
that with near optimum boundaries the coefficients of 
variation are often found to be approximately the same in all 
strata. He concluded however that computing and setting 
equal the standard deviations of the strata would be too 
complicated to be feasible in practice. In what follows we 
show that, for skewed distributions, the coefficients of 
variation can be approximately equalised between strata 


Patricia Gunning, School of Computing, Dublin City University, Dublin 9, Ireland; Jane M. Horgan, School of Computing, Dublin City University, 


Dublin 9, Ireland. 


160 Gunning and Horgan: A New Algorithm for the Construction of Stratum Boundaries 


using the geometric progression. This new algorithm is 
derived in section 2. Section 3 compares the efficiency of 
the new approximation with the cumulative root frequency 
and the Lavallée and Hidiroglou approximations. We 
summarise our findings in section 4. 


2. AN ALTERNATIVE METHOD OF STRATUM 
CONSTRUCTION 


To stratify a population by size is to subdivide it into 
intervals, with endpoints ky <k,,<...,<k,. Ideally, the 
division should be based on the survey variable Y. Such a 
construction is of course not possible since Y is unknown; if 
it were known we would not need to estimate it. In practice 
therefore we use a known auxiliary variable X, which is 
correlated with the survey variable. 

In order to make the breaks (k,,k,,...,k,) for any 
given k, and k,, we seek to make the CV, =S,,,/X,, the 
saine for’ h=1p2) 42 L: 


SSS, SS. (3) 


Now S.,, is the standard deviation and X, the mean of X in 
stratum h: If we make the assumption that the distribution 
within each stratum is approximately uniformly distributed 
we may write 


0% x k, +k, _, 


h ee (4) 


(k, = ipl (5) 


As an approximation to the coefficients of variation, this 
gives 


= (k, —k,_,)/V12 


pe (6) 
(kK, +k,_1)/2 
with equal CV, therefore we must have 
Kinsi Kn _ kn aKa (7) 
Kini tk, kt higeg 


This new and exotic recurrence relation reduces however to 
something familiar: 


eae (8) 


the stratum boundaries are the terms of a geometric 
progression. 


ko Saruh Oe weep (9) 


Thus a=k,, the minimum value of the variable, and 
ar’ =k,, the maximum value of the variable. It follows 
that the constant ratio can be calculated as r = (k, /ky)'’”. 
For a numerical example take 


L=4; ky =5; k, =50,000: (10) 


thus k, =5.10"(h=0, 1, 2, 3, 4) and the strata form the 
ranges 


5= 0; 50 —500; 500 — 5.0005, 000 — 50000 marr) 


This is clearly an extremely simple method of obtaining 
stratum breaks. 

The relationship in (8) depends on the assumption that 
the distributions within strata are uniform. This may be 
justified by the following heuristic argument. When the 
parent distribution is positively skewed, then the low values 
of the variable have a high incidence, which decreases as the 
variable values increase, which makes it appropriate to take 
small intervals at the beginning and large intervals at the 
end. This is what happens with a geometric series of 
constant ratio greater than one. In the lower range of the 
variable, the strata are narrow so that an assumption of 
rectangular distribution in them is not unreasonable. As the 
value of the variable increases, the stratum width increases 
geometrically. This coincides with the decreased rate of 
change of the incidence of the positively skewed variable, 
so here also the assumption of uniformity is reasonable. 

This algorithm will of course not work for normal 
distributions. Also since the boundaries increase geo- 
metrically, it will not work well with variables that have 
very low starting points: this will lead to too many small 
strata; the rule breaks down completely when the lower end 
point is zero. We expect the best results when the 
distribution is highly positively skewed and the upper part 
contains a small percentage of the total frequency. 


3. THE PERFORMANCE OF THE ALGORITHM 


3.1 Some Real Positively Skewed Populations 


To test our algorithm, we implement it on four specific 
populations, which are skewed with positive tail: 

Our first population (Population 1) is an accounting 
population of debtors in an Irish firm, detailed in Horgan 
(2003). In addition, we use three of the skewed populations 
that Cochran (1961) invoked to illustrate the efficiency of 
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the cumulative root frequency method of stratum 
construction. These are: 


— The population in thousands of US cities 
(Population 2); 

— The number of students in four-year US colleges 
(Population 3); 

— The resources in millions of dollars of a large 
commercial bank in the US (Population 4). 


There were five other populations in the Cochran paper, 
which turned out to be unsuitable for use with our 
algorithm. In three cases the variable was a proportion: 
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agricultural loans, real estate loans and independent loans 
expressed as a percentage of the total amount of bank loans. 
Another, a population of farms in which the variable ranged 
from 1 to 18, was essentially discrete. Yet another, a 
population of income tax returns, was not sufficiently 
skewed: it owed its skewness to the top 0.05% of the 
population, and when this was removed, or put in a take-all 
stratum, the skewness disappeared. 

These four populations are illustrated and summarised in 
Figure | and Table 1| in decreasing order of skewness. 
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Figure 1. Populations 
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The new algorithm is implemented on these populations, 
and compared with the cumulative root frequency 
(cum wala ) and the Lavallée-Hidiroglou methods of stratum 
construction. 


3.2 Comparison with the Cumulative Root 
Frequency Method 


We first compare the performance of the new algorithm 
with cum eh by dividing the populations summarised in 
Table 1 into L = 3, 4 and 5 strata, using both methods to 
make the breaks. The results are given in Tables 2, 3 and 4. 

A cursory examination of the coefficients of variation in 
Tables 2, 3 and 4 suggests that, in most cases, the geometric 
method is more successful than cum ,/ f in obtaining near- 
equal strata CV,. For example in Population 1, which has 
the greatest skewness, the CV, differ substantially from 
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each other when cum ,/ f is used to make the breaks, while 
the geometric method appears to achieve near-equal CV, in 
all cases of 3, 4 and 5 strata: the best results are obtained 
with L = 5. In the other three populations, the CV, are not 
as diverse with cum he , but they still appear more 
variable than those obtained with the geometric method of 
stratum construction. 

The CV, with the geometric method are more 
homogeneous when L = 4 or 5 than when L = 3; this is to be 
expected since the validity of the assumption of uniformity 
of the distribution of elements within stratum is strength- 
ened with increased number of strata. 

A more detailed analysis of the variability of the CV, 
between strata is given in Table 5, where the standard 
deviation of the CV, is calculated for each design. 


Table 1 
Summary Statistics for Real Populations 
Population N Range Skewness Mean Variance 
1 3,369 40 — 28,000 6.44 838.64 8S 27) 
2 1,038 10 — 200 2.88 32,5) 924 
3 677 200 — 10,000 2.46 1,563.00 3,236,602 
4 Bey 70 — 1,000 2.08 225,62 36,274 
Table 2 
The Geometric vs the Cum AWE : Stratum Breaks with L = 3 and n = 100 
Stratification Stratum 
Population Method CV 1 o) 3 
1 Geometric 0.0600 kp 354 3,152 
Nj 2,334 1,288 189 
np 9 46 45 
CV), 0.71 0.68 0.64 
Cum Jf 0.0600 ky, 558 2,236 
Np BB 38) 15D 295 
Np 19 17 64 
CV), 0.70 0.42 0.76 
2 Geometric 0.0270 ky 26 Te 
Np 701 243 94 
Np 36 29 35 
GVcatw 028 0.23 0.33 
Cum Jf 0.0269 i 28 66 
Np 729 208 101 
ny, 40 22 38 
CV), 0.29 0.25 0.34 
3 Geometric 0.0317 Kp 726 2,645 
Np 25 321 103 
np, 9 38 53 
€V, 0032 0.37 0.39 
Gimgyy ~ 0.0282 & L179 3,629 
Np 456 152 69 
np 37 35 28 
CV), 0.41 0.31 0.27 
4 Geometric 0.0184 Kp 168 405 
Np 211 93 53 
ny 27 27 46 
CV), 0.23 0.24 0.30 
Cum ff 0.0198 sk, 162 441 
Mp 207 107 43 
ny 25 39 36 
cy © es 0.30 0.27 
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Table 3 
The Geometric vs the Cum Vf : Stratum Breaks with L = 4 and n = 100 
Stratification Stratum 
Population Method CV 1 2» 3 4 
1 Geometric 0.0430 kp 205 1,057 5,443 
Np 1,416 1,382 483 88 
Np 6 22 40 32 
CV), 0.45 0.44 0.48 0.50 
Cum a 0.0480 Kp 558 [SLIDE 2,795 
Ny, 23339 483 325 Pape 
ny, 23 5 10 62 
CV), 0.70 0.19 0.27 0.69 
2 Geometric 0.0194 kp 20 43 93 200 
Np 459 398 130 Sill 
np 22 31 25 22 
CV, 0.22 0.20 0.22 0.22 
Cuml/foenooniaeees k, 19 38 85 
Np B98 428 155 62 
ny 15 26 30 29 
CV, 0.20 0.17 0.25 0.26 
3 Geometric 0.0214 kp 526 1,386 3,653 
Np 138 343 127 69 
np 5 27 26 42 
CV. gage 0.26 0.26 0.27 
Cum ie 0.0230 Kp 690 2,160 5,100 
Np, 23) BIg 75 48 
Np 1 43 21 28 
CV;, 0.31 0.33 0.29 0.19 
4 Geometric 0.0142 kp 134 261 504 
Np, 156 109 63 29 
Ap 20 23 wy 28 
CV, 0.18 0.19 0.19 0.20 
Cum ae 0.0143 Kp 162 BSS 488 
Np 207 58 57 35 
ny, 33 9 23 35 
CV), 0.23 0.11 0.18 0.24 
Table 4 
The Geometric vs the Cum re? : Stratum Breaks with L = 5 and n = 100 
Stratification Stratum 
Population Method GV, 1 2 3 4 5 
1 Geometric 0.0360 kp 147 549 2,037 Tse? 
Np 1,054 267 732 265 51 
Np z 14 Ai 33 24 
CV), Oisy7/ 0.38 0.40 0.37 0.41 
Cum ff 00349 Wa, 1279 838 1,677 4,193 
Np 1,644 1,010 BSD 249 134 
Np S 14 7 15 25) 
CV), 0.52 0.30 0.20 0.25 0.57 
2 Geometric 0.0144 kp, 17 32 59 108 
Np, 364 418 130 87 39 
Np 18 28 ia 20 il7/ 
CV), 0.18 0.14 0.15 0.16 0.15 
Cum ff 0.0186 sk, 28 38 57 104 
N, 109 92 89 88 40 
np 58 4 7 16 15 
CV), 0.28 0.08 0.11 0.16 0.16 
3 Geometric 0.0184 kp 433 941 2,043 4,434 
Ny 100 2S 1,989 74 56 


Cum ff 0.0212 k, 1,179 1,669 3,139 6,079 


ny, 50 3 17 15 15 
CV, 0.40 0.09 0.20 0.19 0.13 
4 Geometric 0.0110 Eo tis 200 339 576 
Np 114 116 64 39 24 
ny, 12 20 24 18 24 
CV, 0.14 0.14 0.17 C12" “O16 
Cum Jf 0.0119 be 62 255 395 627 
N, 207 58 37 36 19 
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Table 5 
The Variability of the CV, for the Geometric and the Cum ,/f Methods 


Strata 
1 


3 Geometric 0.035 
Cum Jf 0.181 
4 Geometric 0.027 
Cum ff 0.276 
S Geometric 0.018 
Cum ff 0.166 


We see from Table 5 that, with just two exceptions, the 
standard deviations of the CV, are substantially lower with 
the geometric method of stratum construction than with cum 
a In the two cases where the cumulative root has a lower 
standard deviation than the geometric, the differences 
between them is not great, and occur with the smallest 
number of strata, L = 3, in Populations 2 and 4. We may 
conclude therefore that the new algorithm is successful in 
breaking the strata in such a way that the CV, are near 
equal. 

What remains is to investigate whether the geometric 
breaks lead to more efficient estimation than cum Vr . To 
do this, the two methods are compared in terms of the 
relative efficiency or variance ratio obtained with n = 100 
allocated optimally among the strata using Neyman 
allocation (Neyman 1934): 


N.S. 
ny, = - husaxh n (12) 
ial NS 
The relative efficiency is defined as 
Veum Xa) 
ayy. pan eom a = ? (13) 
: ae ee ) 
where Voum(%,) and Vy .om(%,,) are the variances of the 


mean respectively with the cumulative root frequency and 
the geometric methods, with n = 100 and n, allocated as 
in (12) for each of the stratification methods. In sample size 
planning the relative efficiencies may be interpreted as the 
proportionate increase or decrease in the sample size with 
cum aie to obtain the same precision as that of the 
geometric method withn = 100. 

The variance calculations are based on the auxiliary 
variable X, and since this is assumed to be highly correlated 
with the unknown survey variable Y, we can assume the 
relative efficiency eff, given in (13), will be a reasonable 
approximation of the relative efficiency of Y. 

Table 6 gives the variance ratio when the number of 
strata L = 3, 4 and 5. 

From Table 6 we see that, while this new method is not 
always more efficient than the cumulative root frequency 
method of stratum construction, when it is, it is substantially 


Population 
2 3 4 
0.050 0.036 0.038 
0.045 0.072 0.035 
0.010 0.006 0.008 
0.042 0.062 0.059 
0.015 0.013 0.020 
0.076 0.119 0.054 


so, and when it is not it is only marginally worse. For 
example, large gains in efficiency are observed when L = 5 
in Populations 2, 3 and 4: here the relative efficiencies are 
1.69, 1.33 and 1.17 respectively indicating that samples of 
sizes n = 169, 133 and 117 are required with cum ale to 
obtain the sample precision as that of the geometric method 
with n = 100. 


Table 6 
Efficiencies of the Cum a Relative 
to the Geometric Method 


Population 
1 2 3 + 
3 OST, O99 50 72a LAG 
4 128 enil, LOT" AES 
et 0.94 1.69 1:33 


Strata 


17 


We also see from Table 6 that while there are four cases 
where the relative efficiency is less than 1, with one 
exception, all are greater than 0.9. The exception is 
Population 3 with L = 3, the smallest number of strata; the 
relative efficiency in this case is 0.79. 


3.3 Comparison with the Lavallée and Hidiroglou 
Algorithm 


With the Lavallée-Hidiroglou algorithm, the optimum 
boundaries k,,k,-:-k,;_, are chosen to minimise the 
sample size n for a given level of precision. The requirement 
on precision is usually stated by requiring the coefficient of 
variation to be equal to some specified level between 1% — 
10%. Obtaining the minimum 7 is an iterative process, and 
the SAS code used for implementing it was obtained from 
the web at http://www.ulval.ca/pages/|pr/. 

To compare the performance of the new method with 
Lavallée-Hidiroglou, the CVs from the geometric algorithm 
given in Tables 2, 3 and 4 are used as input for the Lavallée- 
Hidiroglou algorithm, and the sample sizes required to 
obtain the same precision as that of the geometric method 
with n = 100 are computed. The results are given in Table 7. 

The first thing to notice from Table 7 is that the sample 
size required with the Lavallée-Hidiroglou algorithm to 
obtain the same precision as the geometric method is greater 
than 100 in all but four cases. In Population 2 with 5 strata, 
it is necessary to increase the sample size by 36% to 
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n = 136, to obtain the same precision as the geometric 
method with n = 100. With three and four strata, sample 
sizes of n = 121 and 113 are required in Population 1, and 
samples sizes of n = 123 and n = 117 are required in 
Population 2, to obtain the same precision as the geometric 
method. When the sample size falls below n = 100, the 
drop is not as large. In Population 4, with four and five 
strata, n = 93 andn = 99 respectively, and in Population 1 
with 5 strata a sample size of n = 90 will suffice with the 
Lavallée-Hidiroglou algorithm to obtain the same precision 
as the geometric method. 

The results in Table 7 might appear to indicate that the 
geometric method outperforms the Lavallée-Hidiroglou 
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method in terms of the minimum sample size required for a 
specified precision. We observe however that the geometric 
method does not give a take-all stratum. If this is required it 
is more appropriate to use the Lavallée-Hidiroglou to obtain 
the strata. Often, in financial applications the top stratum is 
decided judgementally; for example US state taxing 
authorities typically decide their take-all stratum based on a 
total percentage of purchase amounts (Falk, Rotz and 
Young 2003). If after such a take-all stratum has been 
removed the skewness remains, the geometric method is 
probably the easier and more efficient way of obtaining the 
remaining strata. 


Table 7 
Boundaries and Sample Size Required with the Lavallée-Hidiroglou Method to Obtain the Same 
CV as the Geometric Method when n= 100 


Population n GV; 
121 
2 123 0.0270 Kp 
3 107 


0.0317 ki, 


4 100 0.0184 kp 


0.0430 ky 
0.0194 sky, 
3 103 


0.0214 ki, 


0.0142 ky 


0.0360 ky, 
d 136 0.0144 sk, 
3 105 


0.0184 =k, 


0.0119 kp 


3 Strata 
2 3 
1,248 8,676 
2,867 464 38 
42 41 38 
0.87 0.57 0.37 
35 102 
795 202 4] 
47 35 4] 
0.31 0.31 0.17 
1,398 4,197 
481 135 61 
28 18 61 
0.41 0.30 0.24 
172 361 
2ND 85 60 
22, 18 60 
0.23 0.21 0.32 
4 Strata 
1 2 4 
442 1,828 8,411 
2,086 915 327 4] 
16 21 35 41 
0.64 0.41 0.45 38 
19 37 95 
393 420 176 49 
13 21 34 49 
0.19 0.16 0.28 0.21 
740 1,505 3,819 
256 234 118 69 
9 10 15 69 
0.32 0.18 0.25 0.27 
7 188 359 
tal 112 74 60 
7 9 17 60 
0.14 0.12 0.19 0.32 
5 Strata 
1 D 3 4 5 
342 1,153 BAS 10,301 
1,846 993 357 147 26 
12 14 17 21 26 
0.58 0.34 0.31 0.31 0.32 
14 21 35 80 
189 270 336 164 79 
4 7 16 30 719 
0.12 0.10 0.12 0.24 0.30 
S12) 869 LEST 3,675 
133 180 185 110 69 
4 5) 10 17 69 
0.27 0.15 0.16 0.23 O27; 
99 130 189 339 
70 68 85 71 63 
4 4 8 20 63 
0.10 0.08 0.10 0.18 0.33 
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4. SUMMARY 


This paper derives a simple algorithm for the 
construction of stratum boundaries in positively skewed 
populations, for which it is shown that the stratum breaks 
may be obtained using the geometric distribution. The 
proposed method is easier to implement than approxi- 
mations previously proposed. Comparisons with the com- 
monly used cumulative root frequency method using four 
positively skewed real populations divided into three, four 
and five strata, showed substantial gains in the precision of 
the estimator of the mean; the greatest gains occurring when 
the number of strata was five. Comparisons with the 
Lavallée-Hidiroglou method indicated that a greater sample 
size was required to obtain the same precision as the 
geometric method is most cases; the greatest increase in the 
required sample size occurred with the largest number of 
strata. One limitation of the new algorithm compared to the 
Lavallée-Hidiroglou method of stratum construction is that 
it does not determine a take-all top stratum. 
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Feeding Back Information on Ineligibility from Sample Surveys 
to the Frame 


DAN HEDLIN and SUOJIN WANG ' 


ABSTRACT 


It is usually discovered in the data collection phase of a survey that some units in the sample are ineligible even if the frame 
information has indicated otherwise. For example, in many business surveys a nonnegligible proportion of the sampled units 
will have ceased trading since the latest update of the frame. This information may be fed back to the frame and used in 
subsequent surveys, thereby making forthcoming samples more efficient by avoiding sampling ineligible units. On the first 
of two survey occasions, we assume that all ineligible units in the sample (or set of samples) are detected and excluded from 
the frame. On the second occasion, a subsample of the eligible part is observed again. The subsample may be augmented 
with a fresh sample that will contain both eligible and ineligible units. We investigate what effect on survey estimation the 
process of feeding back information on ineligibility may have, and derive an expression for the bias that can occur as a 
result of feeding back. The focus is on estimation of the total using the common expansion estimator. An estimator that is 
nearly unbiased in the presence of feed back is obtained. This estimator relies on consistent estimates of the number of 
eligible and ineligible units in the population being available. 


KEY WORDS: Dead unit; Feed back bias; Overcoverage; Permanent random number sampling; Panel survey; 
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Coordinated samples. 


1. INTRODUCTION 


To facilitate estimation of change, consecutive samples 
in a repeated survey are usually overlapping. If several 
surveys draw samples from the same frame, it is often 
desirable to spread the response burden out by making sure 
that samples for different surveys are not overlapping to a 
greater extent than necessary. This is particularly desirable if 
the frame is moderately large and used for many continuing 
surveys, which is a situation that many national statistical 
institutes face when conducting business surveys. Stratified 
simple random sampling is a very common design for busi- 
ness surveys. The skewed distribution of businesses calls for 
large sampling fractions in many strata, which aggravates 
the response burden for medium size and large businesses. 
Both estimation of change and response burden issues are of 
paramount importance in official business statistics. There- 
fore, sampling systems have been constructed that allow the 
organisation to co-ordinate samples, either positively or 
negatively (i.e. to create overlap or to make sure that there is 
little overlap). 

For example, the Office for National Statistics (ONS) in 
the United Kingdom uses the Permanent Random Number 
(PRN) technique, which is a widely used method for 
drawing samples from lists. A PRN from the uniform distri- 
bution on [0,1] is attached to each frame unit independently 
of each other and independently of the unit labels and any 
variables associated with the units. Each unit will retain the 


1 


PRN throughout its existence. The units can be ordered 
along a line starting at 0 and ending at | and we refer to this 
line as the PRN line. To draw a simple random sample 
without replacement, an S/, with a predetermined sample 
size n, a point is selected (randomly or purposively) on the 
PRN line and the 7 units to the right (say) are included in the 
sample. Two SIs are fully co-ordinated if they are drawn 
from the same interval. For overviews and further details 
see Ohlsson (1995) and Ernst, Valliant and Casady (2000). 
Samples for repeated surveys can also be selected with a 
panel technique where a set of rotation groups are selected 
at the first wave and one, say, of the groups is replaced with 
a fresh rotation group at the second wave and the other 
groups are retained in the sample. The difference between 
PRN sampling and panel sampling is more about the way to 
control overlaps than having different sampling designs. 
There are in principle two main sources of data that are 
used to maintain a frame: administrative ones and surveys. 
Various administrative bodies send tapes to the ONS on a 
regular basis with information on, e.g., births and deaths of 
businesses. While these tapes are sent to the ONS very 
frequently, the distribution of the time it takes for a new unit 
or an alteration of an old unit to be registered on the frame is 
highly skewed. This is partly due to frame maintenance 
procedures, e.g. to avoid duplicates. There is also very often 
a considerable difference in time between the actual and 
formal termination of a business. Therefore, most of the 
ONS’s business surveys share the information on deaths 
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they obtain through their samples with other business 
surveys to speed up the information process. We examine 
the effects of using sample surveys to update a frame that is 
used for repeated surveys. This is in principle how infor- 
mation on dead units is treated in business surveys at the 
ONS, Statistics Sweden, and some other national statistical 
institutes. 

It would seem natural that this new information should 
be made available to other sample surveys, which otherwise 
may include the dead units in their samples and therefore 
lose precision. However, as pointed out by Srinath (1987) 
among others, such a procedure may cause bias. We refer to 
this as feed back bias, which results whenever the sampling 
mechanism is not independent of the feed back procedure. 
For example, consider a situation where all dead units are 
found and deleted at the first wave of a panel survey. If no 
further deaths have occurred up to the second-wave obser- 
vation of the panel units, the second-wave sample contains 
only live units. Without knowledge of the total number of 
live units in the population at the time of the second wave, 
an unbiased estimator of the total cannot be constructed. 
While more information about the population has been 
gathered when the deaths were recorded at the first wave, 
there is actually less information in the second wave-sample 
on the proportion of live units in the population. We show 
how an estimate of the number of live units in the popu- 
lation can be used to construct an approximately unbiased 
estimate of the population total. 

A safe recommendation would be that no information on 
deaths from sample surveys, other than from completely 
enumerated strata, may be used to update the frame when 
samples are co-ordinated over time (cf Ohlsson 1995, page 
168, and Colledge 1989, page 103). However, to prohibit 
feeding back seems to deny oneself the use of all available 
information. We obtain an expression for the feed back bias 
and show that the feed back bias can be estimated and used 
to adjust conventional estimators. Schiopu-Kratina and 
Srinath (1991) adjust the sampling weights to counter an 
expected too low proportion of dead units in the rotating 
sample of the Survey of Employment, Payroll and Hours 
conducted by Statistics Canada. Hidiroglou and Laniel 
(2001) discuss the feed back issue briefly. A general discus- 
sion of frame issues is given by Colledge (1995) and over- 
views of issues associated with continuing business surveys 
include College (1989), Hidiroglou and Srinath (1993), 
Srinath and Carpenter (1995), and Hidiroglou and Laniel 
(2001). 

Instead of the terms eligible and ineligible we use the 
more emotive words dead and live, although our reasoning 
does cover all kinds of ineligibility. The discussion is 
confined to the estimation of the total 


t= Day Oe () 


of some study variable y’=(y,,y.,..-, Yy) ON a popu- 
lation U with unit labels {1, 2, ..., N}. 

When the sampled units are observed, we assume that all 
dead units in the sample are classified as dead and the frame 
is updated with this information. This may be difficult in 
practice. In some surveys, however, the eligibility of all 
nonresponding units can be correctly identified. 

Section 2 introduces the necessary notation and concepts 
and gives expressions for the feed back bias when esti- 
mating a total. Section 3 discusses three strategies that may 
be used in the presence of feed back and compares these in a 
simulation study. The paper concludes with a discussion in 
section 4. 


2. EXPRESSIONS FOR FEED BACK BIAS 


2.1 Introduction and Notation 


We assume throughout that a dead unit is always out of 
scope and that the value of the study variable of a dead unit 
is always zero. (It is conceivable that dead units are eligible 
in some surveys; for example, a business survey collecting 
data on production may have defined businesses that were 
alive at least part of the reference period as eligible.) We 
adopt the design-based view that the survey population and 
the study variable are fixed and non-stochastic at any given 
point in time. The situation we address is as follows. One or 
more samples are drawn from the frame which comprises 
the original survey population, U;. Let the set of samples 
drawn from U; be denoted by s,. For convenience we 
assume that the frame units and population units are of the 
same type. We refer to the updated frame, where all dead 
units that have been included in samples from U; have been 
excluded, as the current survey population, U>. For 
example, two surveys may simultaneously work with a 
sample each, and after they have fed back, U; has shrunk to 
U>. We disregard births of new units and other deaths than 
those deleted through samples from U;. We will also 
disregard undercoverage, nonresponse and measurement 
errors. In practice, administrative sources will provide 
information on deaths. They work independently from the 
sampling procedures employed by the statistical agency and 
will therefore not contribute to feed back bias. These units 
are dead by administrative sources. We can think of these 
dead units as being excluded from the population. See 
Hidiroglou and Laniel (2001) for a discussion of estimation 
in the presence of units deathed by administrative sources. 
While the sampling design here is assumed to be SI, it can 
readily be extended to stratified simple random sampling. 
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Let U4 and U2, be the two subsets of the current survey 
population, U; = U,, UU, _, that consist of dead and live 
units, respectively. All units in U;, and U2, are assumed to 
be flagged as live. Units that are flagged as dead but for 
which the independence of detection and the sampling 
mechanism cannot be assured are called dead _by sample 
survey sources. In our set-up, these are the dead units 
detected in samples taken from U;. Let the set of these units 
be denoted by s)4, and we have the relationship U, = 
U,Us,,. Figure | displays the sets and their relationships. 
Let N and n with a proper subscript be the size of the 
corresponding population and sample(s), respectively. Then 
N, = N2+ 11,q and Nz = Ny; + Noq. At the time when samples 
are drawn from U >, N> and n; 4 are known numbers, whereas 
N>, and N24 are unknown. Moreover, 1,g, N2,q and Nz could 
be viewed as random depending on feed back results, while 
N2, is fixed. Following principles of Durbin (1969) and 
more recently in Thompson (1997), we would in many 
situations prefer to condition on n;4. For example, if it is 
seen that n;g=0, then it does not seem appropriate to 
include in the inference the possibility that n;, could have 
been large. However, to analyse the development of the feed 
back bias over a series of waves in a panel survey when 
planning the survey, unconditional analysis would be 
preferable. We also provide an expression for the 
unconditional feed back bias. 

Denote by s,, the live part of 5), ie., the part of U2 that 
was covered by the previous sample(s) drawn from U;; see 
Figure 1. Clearly, s;; is a random set and we have 
s,, ©U,,. Let the nonsampled part of Uz be denoted by 
Un wa Cwd’ for ‘with dead units’). It is also a random set and 
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encompasses all of Uz, and part of U2). We have U;= 
Ya wa USI 

Let s> be an SI taken from U>. Estimators based on 5s» will 
suffer from feed back bias unless special information is at 
hand, such as knowledge about N;, which is not usually the 
case. To derive an expression for the feed back bias we shall 
first obtain the inclusion probabilities. To do this, it is useful _ 
to consider the two sample parts of s2 separately: the sample 
part 52, of size nz, taken from s,, through PRN sampling or 
a panel sampling technique, and the remaining part s2, 
taken from U) wa. If the sampling is done with a panel 
technique, the sample parts s),, and s2, are the old and new 
rotation groups, respectively. If the sample is drawn with 
PRN sampling, s2, and sz, consist of units with PRN’s that 
fell in s; or did not fall in s,, respectively. Whether the 
sample was drawn through PRN sampling or a_ panel 
sampling technique, the sample parts can be viewed as two 
fixed size samples, each drawn with the SI design from their 
respective subpopulation. We condition on mm, and m, 
throughout without making it explicit in formulae. With the 
notation (k€ s,,) we refer to the event that a unit is first 
included in the first-wave sample(s) from U and then in the 
second-wave sample taken from what remains of the first- 
wave sample(s) after dead units have been taken out. The 
notation (ke s,,) is analogous. Let /(ke s,,)=1 when 
unit k is included in s2,, otherwise [(ke s,,)=0. To 
derive the overall bias it is convenient to analyse the biases 
from the sample parts s2, and s,. We derive an expression 
for each of these in section 2.2 and section 2.3, respectively, 
and in section 2.4 the bias expressions will be amalgamated. 


<< U, 


Figure 1. The original survey population, U,, and its subsets. The grey area represents s,, the sample from U5. 
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2.2 Feed Back Bias from a Sub-sample from the 
Original Sample 


Suppose a sub-sample s2,, is taken from sj), the live part 
of the first-wave sample(s). Recall that y, =0 if k is a dead 
unit and that U; = U,,UU,,. Thus we have >, y, = 
Du, VALCKE $2.4) = Xu, V,A(K € S2,,) . Assume that Np, > 0. 
Then we obtain that Prike s,,|n,,]=1,/N>,, since a 
sample of size nz, is effectively selected from a population 
of size N>, with the SI design (through an SI sample from 
U, followed by an SI sample from U,). Note that a unit k in 
52, must be alive since U2, consists solely of live units. 

Denote the bias of an estimator 6 for the parameter 0 by 
B(6, 6). Then with respect to the population total ft, = 
dv, ¥,» the conditional bias of a general linear estimator 
pra) Ls, We Nk based on 524, with any given w, ’S, 1s 


ip 
Na )= Say i”, Prk re Le is 1} Ve 


y 


Be» 


Wy 


w,n 
-¥.( Ke 2a ny Ye: (2) 
: N) 


For the sample part s2,, the naive expansion estimator 
that ignores feed back bias would have weights w; = N>/n» 4. 
From (2) the bias of the estimator ea =N,/ny42s,, Vi 
is 


N. 
ma )= et, (3) 


2.3 Feed Back Bias from a Sample Taken Afresh 
from the Current Survey Population 


Next, we derive the bias arising from the sample part 52, 
of size nz, taken from U2 through U,,,,, see Figure 1. First 
note that 


n 
at bee ma |= ao (4) 


2,wd 


Prik Sis 


From (4) we obtain that the conditional expected value of 
“(52 5) 


fe 2 ES 
n 
($y) pi 2,b ; 
Ei Ina )= E N pe OROE a 
2,wd 
Ny, Nr) -ny > 

‘n wey 
U, kk 

ING a N, I ‘ 


The second equation above is due to the fact that 
givenn, ,, all N,, live units in U2 are equally likely to be in 


U, a> Which has N,,—n,, live units. Therefore, the 


A(So4) 3 


conditional bias of 1, *”” is 


(52.5) 
Bl ty 


Ww, N, No) -—m, 
mia}= Dy,| ee Ath ) 
7 [2 N>, 


A(5o4) 


For the expansion estimator 1, 


N,/n,,, the bias is 


with weights w, = 


Bl ee AE Bre (6) 


yn 


where 
N, Ny, -Ny 
Nb wa N) 
Nb wa N,, 


B= = 


— No aMy 
NN 5 wa 


Nya (n, =a) 
No (N, -n) 


The bias is always non-positive since B <0. It is easy to 
see that B is an increasing function of n,, since 
Ny4=N,a—%,q» Where Ni, is the fixed number of all 
dead units in Uj. It is also readily seen that the maximum of 
B is attained when s,, encompasses all dead units in U,, 
that is, when n;,g= Nj,q and consequently N,,=0. 


2.4 Feed Back Bias from Sample Parts Combined 


Combining (6) with (3) we obtain the overall bias of 
wd Mey saypeh ole 


A 


Bt, 


ny )= E(t, | OY baal 


The bias in the expansion estimator is really down to not 
knowing the correct population size. In (3) the bias stems 
from multiplying the sample average over live units with N> 
rather than the unknown N3). The bias from the sample parts 
S2q and s>, will in absolute terms be less than (3) and (6), 
respectively, if some of the dead units in the samples from 
U;, have not been identified as dead and therefore have not 
been weeded out. This would happen, for example, if the 
status of nonresponding units is difficult to determine. 

An unconditional analysis in the presence of feed back 
can be obtained directly by taking expectation of (7) with 
respect to n, , . Thus, unconditionally, we have 
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(8) 


where E(n,,)=n,N,,/N, and V(n,,)=7,N, No ,/N/.- 

Lavallée (1996) took an interesting approach to a similar 
problem with panel survey data. In that paper, the problem 
of frame update using panels with rotation is addressed 
among other issues. Our approach is different from the ap- 
proach of that paper in that we consider the two conditional 
probabilities Pr[kes,,|n,,] and Pr{kes,,|n 4] 
separately. 


3. THREE SIMPLE STRATEGIES AND A 
SIMULATION STUDY 


3.1 Strategies in the Presence of Feed Back 


A strategy, which is referred to as Strategy 1 here, is to 
feed back, delete the set 5s, from the frame and accept the 
feed back bias. However, the size of the bias is seldom 
known. The estimator for Strategy 1 under SI is e = 
N,/n,>5,y, Where s2 is a sample taken from U,. To 
obtain Strategy 2, note that if consistent estimates of Ny, 
and N, are available these may be plugged into (7) or (8) 
and an estimator with favourable properties is obtained: 


i. > re (1 7 eye ? (9) 
where 
C=(N 4 ine Mrg.g 1M — {Ny (MN g )} {ny (N, —1, )}] 


for both the conditional and unconditional cases since the 
term nV (m4 )(15N,N> ya) in (8) is almost always neg- 
ligible. The estimates N, q and No, of the sizes of the 
domains U2, and U2, can be obtained from a sample from 
the original or current survey population. If more than one 
sample is drawn, each can provide an unbiased estimate of 
Noa (or N>)), all of which can be combined. The minimum 
variance combined estimator is the sum of the estimators 
weighted with the reciprocals of their variances. As the 
following argument shows, we do not expect the bias of (9) 
to be large: 


E(R,)= Ele (1+ 2)" 26,,)(+e)" 
Sq (io) elf. 


y 
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Another strategy, here denoted by Strategy 3, is to feed 
back the information that certain units are dead, but to retain 
them on the frame and allow them to be sampled. The 
resulting estimator is unbiased, but the disadvantage of this 
strategy is that the precision will suffer as part of the sample 
is lost on ineligible units. The estimator of Strategy 3 is 
rie =N,/n,>,y,, where r is a sample from the original 
survey population U;. 


3.2 A Simulation Study 


A simulation study may shed some light on which of the 
Strategies 1-3 is to be preferred. Natural measures for 
comparing the strategies are bias and variance. In business 
surveys, estimates for subpopulations (industries) are often 
more interesting than the whole population. To simulate a 
subpopulation, a frame consisting of 1,000 units was created 
to form the original survey population. A gamma distributed 
value, Y1, was associated with each unit. We used the same 
gamma distribution as the one that generated Population 12 
in Lee, Rancourt and Sarndal (1994, page 236). The coef- 
ficient of variation (population standard deviation divided 
by the mean) was 0.57. Another study variable, Y2, was 
created by performing independent Bernoulli trials, one for 
each population unit, which obtained value | with proba- 
bility equal to 0.5 and value O otherwise. Unlike in Lee 
et al., some of the units were dead. Each unit was inde- 
pendently of other units classified as dead with a probability 
Pueaa- All dead units were assigned zero values for both Y1 
and Y2. A set of Y1 and Y2 were simulated for each of four 
values of Pgeag: 0.03, 0.05, 0.2, and 0.5. These sets contained 
29, 54, 201 and 494 dead units, respectively. 

A PRN was attached to each unit and the units were laid 
out along a PRN line. The first sample, s;, was drawn by 
identifying the 500 units with the smallest PRNs. All dead 
units in s; were flagged as ‘dead by sample survey sources’. 
Hence, s; covered approximately the first half of the PRN 
line. The frame with the units flagged as dead by sample 
survey sources excluded made up the current survey 
population. The estimates of N>q and N2, used in Strategy 2 
were based on s;. A second sample, denoted by S>current, WAS 
drawn by taking 100 units to the right of a starting point, 
start 2, disregarding units dead by sample survey sources. 
Another sample of 100 units was selected from start 2, but 
units dead by sample survey sources were this time allowed 
to be included in this sample. Hence, this sample was drawn 
from Uj, and we denote it by Saorig. The sample S2current 1S 
pertinent to Strategies 1 and 2 while s2o,j¢ will be used for 
Strategy 3. 

The procedure described in the preceding paragraph was 
repeated 1,000 times. That is, for each of the values of Puecaa 
mentioned above and for each of three starting points of 59, 
to be defined, 1,000 sets of PRNs were generated and 
attached to the units. The frame was reordered for each new 
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set of PRNs, and three samples were drawn for each 
reordering (51, S2currents AN Sor). Two values of start 2, 0.0 
and (0.7, were chosen so as to make the proportion Of S current 
that fell in s;; 100% and 0%, respectively. That is, 124/n2 
was set to 100% and 0%. Further, to make nz,/n2 on aver- 
age 50% under each of the chosen Pea, appropriate values 
of start 2 were derived. They are 0.448, 0.447, 0.438, and 
0.4 for the Pgeaq values 0.03, 0.05, 0.2, and 0.5, respectively. 

In summary, the population and samples sizes, the study 
variables Y1 and Y2, and which of the units that were dead 
were held fixed in our study. For twelve combinations of 
Paead and nz,,/nNz, the reordering of the units on the PRN line 
through the simulation of new PRNs made the following 
factors vary: 

— which of the units that were included in 5, S2curent> 

and S2origs 
— how many and which of the dead units that were 
dead by sample survey sources; 

— which of the units that belonged to s,; and U wa. 

Thus the quantities s,g, Nog and N, vary in the 
simulations. It seems practical to let them do so rather than 
controlling them in an experiment with more factors than 
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Paeaa and n2,/n2. Hence the results are unconditional, in 
accordance with (8). 


3.3 Results 


Table 1 shows the empirical relative bias of Strategies 1 
and 2, computed as the straight average of the 1,000 
differences between the estimate and the parameter in terms 
of the percentage of the total obtained in the simulation. 
Strategy 3 is unbiased and is therefore not included in Table 
1. The empirical bias of Strategy 3 that nevertheless ap- 
peared in the simulations reflects the simulation error; it was 
at most 0.5%. As seen in Table 1, Strategy 2 is virtually 
unbiased as well. Note that the simulated empirical bias 
under Strategy 1 is what (8) predicts (with allowance for 
simulation error). This bias is appreciable in nearly all cases 
and if the proportion of dead (or ineligible) units is high the 
bias can be very severe indeed. Figure 2 shows the condi- 
tional bias given nj 4 for Paeaa= 0.50 and n,,/n, =0%. 
Note that the bias given by (6) is locally well described by 
the regression line in the figure defined by the OLS fit of the 
bias conditional on n; 4. For example, if n;,~= 220, then both 
N,,/N,, and (n,—n,,)/(N,—n,) equal 0.56 and B = 
—0.31. 


Table 1 
Bias, % of Total of Y1. The First Entry in Each Cell is the Bias Under 
Strategy 1, the Second is the Bias Under Strategy 2. 


Average of n,/n 


ahs 0% 50% 100% 
0.03 —1.6 —0.1 0.4 0.4 18 0.0 
0.05 —2.8 0.0 0.4 0.4 29 0.0 
0.20 —10.2 —0.2 1.5 0.4 2) 0.1 
0.50 —24.6 0.2 PES: 0.3 49.0 0.2 
0.1: Tarr 
0.0+ 
-0.17 aie ae 
EE Se Pe ae 
' s** 3% Pe t wate ; * 
We ee ee —— 
2 -0.2> We fod toc e pega Wee a ee weg 
a : ; a adit tio SN a3 
= , ve Vy thgagerr ss hae bs ae 
mee reed (OM bee 
5 -034.———“"y tel se 
S| : bye wepriete ie pee 
S tgs SEES Ske teen g 
oO b A Pe 
-0.4-= : : 3 
0.5 
aS is OD Seine ee | eee ely 
220 230 240 250 260 270 280 


Number of units dead by sample survey sources 


Figure 2. The simulated conditional bias plotted against the number of units dead by 
sample survey sources, m4, for Pdeag= 0.50 and ny ,/n,=0%. An OLS 
regression line shows the local trend of the conditional bias as a function of n 4. 
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To assess the bias it helps to look at the coverage 
probabilities. Table 2 shows the empirical coverage proba- 
bilities, based on symmetric ‘confidence intervals’ with a 
width of two times the simulated empirical standard 
deviation of each side of the point estimate. While Strategy 
2 gives in all cells coverage probabilities close to the 
targeted 95%, Strategy 1 achieves that in general only for 
the population with 3% dead units. The coverage probability 
under Strategy 1 tends also to be acceptable for populations 
with a larger proportion of dead units, if half of the sample 
is taken from the part of the PRN line where dead units have 
been weeded out, and the other half from the part of the 
PRN line where the original proportion of dead units has 
been retained, as the negative bias from the first half of the 
sample tends to cancel out the positive bias from the second 
half. 
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The variance of the simulated estimates was computed. 
Tables 3 and 4 show the variance comparisons for Y1 and 
Y2, respectively, under Strategies 2 and 3 relative to that of 
Strategy 1. As expected, in all cases Strategy 1 gave a 
smaller variance than did Strategy 3. Strategy 2 performed 
well in most cases, but considering the extra complexity of 
this strategy, the feed back Strategy 1 seems preferable for 
populations with a small proportion of ineligible units, say 
3% or less. However, if this proportion is larger than, say, 
5%, the bias of Strategy 1 may cause poor coverage proba- 
bilities and misleading estimates. The variance of Strategy 2 
is no worse than that of Strategy 3; in most cases Strategy 2 
is superior. The non-monotone variance ratios in the bottom 
row of Table 3 is due to the estimation of Nz, and Ng, 
combined with the specific details of the simulation. 


Table 2 
The Coverage Probability in Percentage for Estimating Total of Y1. The First Entry 
in Each Cell is the Coverage Probability Under Strategy 1, the Second is the 
Coverage Probability Under Strategy 2. 


Average of n/n 


Paead 0% 50% 100% 

0.03 94.6 94.3 94.6 94.8 94.3 95.1 

0.05 93.3 95.2 94.4 93.9 90.8 95.0 

0.20 65.9 94.5 93.8 94.8 46.1 94.6 

0.50 21.2 95.1 78.4 94.7 0.0 94.8 
Table 3 


Variance Ratio of the Estimator of the Total of Y1. The First Entry in Each Cell 
is the Variance Under Strategy 2 Relative to that of Strategy 1, the 
Second is the Variance Under Strategy 3 Relative to Strategy 1. 


Average of n,/n 


| gree 0% 50% 100% 

0.03 1.04 1.04 1.00 1.06 0.98 1.08 

0.05 1.08 1.08 0.98 1.14 0.95 1 ie) 

0.20 1.28 1.28 0.85 127, 0.83 1.46 

0.50 1.85 1.85 0.52 1.34 0.58 2.24 
Table 4 


Variance Ratio of the Estimator of the Total of Y2. The First Entry in Each Cell 
is the Variance Under Strategy 2 Relative to that of Strategy 1, the 
Second is the Variance Under Strategy 3 Relative to Strategy 1. 


Average of n,/n 


| ere 0% 50% 100% 

0.03 1.03 1.03 1.00 103 0.97 1.03 
0.05 1.06 1.06 099 1.04 0.95 1.06 
0.20 £5 1325 0.92 ietS 0.80 had 
0.50 1.80 1.81 0.65 1.40 0.50 1.36 
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4. DISCUSSION 


This paper gives conditional and unconditional expres- 
sions for the feed back bias when the total is estimated with 
the common expansion estimator. We have shown that the 
feed back bias can be large. With as little as 5% ineligible 
units on the frame, feeding back information of these from 
sample surveys can result in about 2—3% bias. However, a 
small-scale simulation study indicates that if the proportion 
of ineligible units is 3% or less, the feed back strategy does 
not seem to create problems in terms of bias and variance. 

We have also derived a virtually unbiased estimator. The 
simulation study shows that this estimator compares 
favourably in terms of variance with the alternative strategy 
of retaining ineligible units on the frame and letting them be 
included in further samples. This estimator relies on the 
availability of consistent estimates of the number of eligible 
and ineligible units in the population. These estimates may 
be obtained from an earlier sample through the unbiased 
strategy of letting units that have been found dead be 
included in the sample. 

In order to facilitate the theoretical development, we have 
made simplifying assumptions. The most important of these 
is the assumption that all dead units have been found in 
earlier sample surveys and have been fed back to the frame. 
We have envisaged a frame with one ‘white’ area, where all 
ineligibles have been flagged as such, and one ‘black’ area, 
where no ineligibles have been touched. In practice, this is 
not likely to happen. If the frame is moderately large and 
used for many continuing surveys, some of which may feed 
back to varying intensity, the frame will turn ‘grey’ rather 
than ‘black and white’. The feed back bias will then be less 
severe than in the “black and white’ situation. It has not, 
however, been in the scope of this paper to quantify the bias 
for a ‘realistically grey’ frame. In this sense, what has been 
examined in this paper is a worst case scenario. 
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Application of Quality Control in ICR Data Capture: 
2001 Canadian Census of Agriculture 


WALTER MUDRYK and HANSHENG XIE ' 


ABSTRACT 


Intelligent Character Recognition (ICR) has been widely used as a new technology in data capture processing. It was used 
for the first time at Statistics Canada to process the 2001 Canadian Census of Agriculture. This involved many new 
challenges, both operational and methodological. This paper presents an overview of the methodological tools used to put in 
place an efficient ICR system. Since the potential for high levels of error existed at various stages of the operation, Quality 
Assurance (QA) and Quality Control (QC) methods and procedures were built into this operation to ensure a high degree of 
accuracy in the captured data. This paper describes these QA / QC methods along with their results and shows how quality 
improvements were achieved in the ICR Data Capture operation. This paper also identifies the positive impacts of these 


procedures on this operation. 


KEY WORDS: Data Capture; Intelligent Character Recognition (ICR); Quality control; Quality improvement; 


Statistical process control. 


1. INTRODUCTION 


The data capture of the 2001 Canadian Census of 
Agriculture was conducted between July and November 
2001, using relatively new technology called Intelligent 
Character Recognition (ICR). This approach to data capture 
combines Automated Machine Capture which uses optical 
character, mark and image recognition, with Manual 
Capture by operators who ‘key from image’ using a heads- 
up data capture technique. The heads-up data capture 
technique is applied only to fields that can not be recognized 
by the optical system with a sufficiently high degree of 
confidence (that is pre-specified). 

The ICR system offered many benefits to the data 
capture operation, in terms of resource savings and 
productivity gains. At the same time, accuracy became an 
extremely important consideration for processing a large 
number of documents since the potential for unacceptable 
levels of error existed at various stages of the process. In the 
literature, the quality of ICR applications has been studied 
by a few authors; see, e.g., Kalpic (1994) and Pasley (2000), 
among others. Kalpic discussed the coding algorithm and 
the results for the 1991 Census Coding Operation in Croatia 
and Bosnia-Herzegovina, using intelligent optical readers. 
Pasley pointed out that the quality of a scanned image 
usually depends on the quality of the source document, the 
precision of the scanner, the skill of the scanner operator and 
the resolution at which the document was scanned. With 
quality improvement in mind, QA and QC procedures were 
built into the data capture operation for the 2001 Canadian 
Census of Agriculture to ensure a high degree of accuracy in 
this operation. 


Quality Control activities for the ICR Data Capture 
Operation were focused in three main stages of processing, 
namely: document preparation, scanning calibration, and 
data capture of the questionnaires. This was done since each 
of these stages was dependent on one another and each had 
the potential to contribute significant errors down the line. 
Therefore, each component should ideally have its own 
control system. 

It is the purpose of this paper to describe the QA/QC 
methodology and procedures associated with each of the 
main stages of the ICR Data Capture Operation, summarise 
the results obtained from their application and show how 
ongoing quality improvements were achieved in the ICR 
Data Capture operation. 


2. QUALITY PROGRAM OVERVIEW 


To better understand the rationale behind the QA/QC 
procedures, it is worthwhile to give an overview of their 
objectives and methodologies. 


2.1 Objectives 


The overall quality objective for this project was to 
measure, control and improve the quality of the entire ICR 
Data Capture Operation on a continuous basis. This would 
be achieved by implementing a series of QA/QC procedures 
at each critical stage of the operation. The specific 
objectives for each stage were as follows: 


a) Document Preparation: to ensure that only highly 
readable documents would reach the scanning stage. 
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b) Scanning Calibration: to ensure optimal machine set-up 
and calibration prior to the start of production. 

c) Quick Capture (Machine Capture) and Quick Key 
(Manual Capture): to ensure a high level of quality of 
data capture during production. 


2.2 QA/QC Methodologies 


Each major stage of processing was operationally unique 
and therefore, had different quality requirements. As a 
result, QA procedures were applied to the Document 
Preparation operation, and QC procedures to the Scanning 
Calibration, Quick Capture and Quick Key operations. A 
flowchart is given in the Appendix, which shows the various 
stages of the ICR Data Capture Operation and exactly where 
these procedures were applied. 


2.2.1 Document Preparation 


The document preparation operation was essentially 
divided into five sub-processes, specifically: sorting, 
transcription, batching, cutting and storage. This operation 
was responsible for preparing the questionnaires and as- 
sociated batches for scanning by the ICR equipment and 
was performed manually by clerical staff. It included 
activities such as separating the contents of the received 
envelopes by document type (Sorting), re-transcribing dam- 
aged or illegible questionnaires (Transcription), grouping 
questionnaires into batches for registration (Batching), 
cutting the spine of each booklet questionnaire with an 
electric cutter (Cutting) and filing questionnaires in the 
archive (Storage). One of the most important aspects of this 
operation was the identification and isolation of problematic 
questionnaires so that they would not advance undetected to 
the scanning and data capture stages. These problematic 
questionnaires were labeled as ‘outlier’ questionnaires since 
they had problems such as questionnaires being X’ed out or 
written over fields, extraneous markings, illegible entries, 
torn, crumpled or taped documents, etc. 

The potential for error in this operation could lead to 
some problems being experienced at the scanning stage. It 
was felt that QA procedures would be appropriate to ensure 
quality at this stage since many of the clerical functions 
were also subject to various automated system cross-checks. 
The system cross checks ensured that the documents had a 
valid ID, correct number of pages, and that the pages, once 
cut, were aligned and in sequential order. The QA 
procedures consisted of a series of on-going random spot 
checks for each of the five sub-processes. The results of 
each spot check were recorded on a control form and 
summarized for the supervisor to identify if the work was 
being done correctly. Feedback would then be given to the 
individual clerk or group on a regular basis, and corrective 
actions would be taken when necessary. For example, if the 


work was not being performed well, some re-training would 
take place and/or an increase in the frequency of spot- 
checks was done until favorable results were obtained. If 
extensive problems were identified, the supervisor could 
also decide on the amount of re-work required, based on the 
seriousness of the problem observed. 

For the sorting, batching, cutting and storage operations, 
the quality measure selected was ‘percent of questionnaires 
in error (i.e., in keeping with the assumptions required for a 
simple sampling unit). For the transcription operation, the 
probability of multiple independent errors occurring within 
a questionnaire was extremely high and therefore the quality 
measure selected was ‘Defects per Hundred Units, DPHU’ 
(i.e., in keeping with the assumptions required for a com- 
plex sampling unit). 


2.2.2 Scanning Calibration Check 


Experience has shown that if the scanning equipment is 
not properly configured, the potential for generating poor 
quality images increases substantially. It is therefore im- 
perative that the scanning equipment be optimally set prior 
to production and well maintained throughout the scanning 
operation. To ensure this, a QC procedure called the 
Scanning Calibration Check was developed to review the 
machine settings and calibration on an ongoing basis. 

Since the equipment settings of the scanning system 
would tend not to fluctuate too greatly, it was felt that 
Statistical Process Control (SPC) methods would be 
appropriate for controlling this portion of the operation. This 
would essentially be an ongoing spot check of the 
calibration settings performed on a daily basis prior to the 
start of production. The calibration check consisted of re- 
scanning a test batch and comparing the results with the 
corresponding pre-benchmarked results for the same batch. 
The differences between the actual and expected results 
would be compared and error rates computed. These error 
rates were then plotted on SPC control charts to determine if 
the process was operating at an acceptable level. If this test 
batch failed, the scanning process would not be allowed to 
start production until the machine was re-calibrated and 
subsequently re-tested successfully. 

In the Scanning operation, machine recognition could 
substitute wrong values when poor quality images are 
produced. Poor images could be the result of many factors 
such as dirty read heads, smeared optical windows, mis- 
alignment, mis-registration of fields, poor contrast / 
brightness levels, paper feed problems, etc. Since a specific 
quality standard was established for each field type, a 
separate p control chart was used to evaluate the substitution 
error rate for each type (specifically, alpha, alphanumeric, 
numeric, tick boxes and bar codes). The acceptable quality 
standard for each field type was previously established on a 
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field type basis by the client area so therefore, the quality 
measure used was ‘percent of fields in error’, i.e., the 
substitution error rate by field type for each scanner. 

Based on SPC control chart theory, a decision for each 
scanning calibration test was made as follows: 


— If each of the sample error rates for the five field types 
was respectively lower than their corresponding upper 
control limit (UCL), it was concluded that the scanning 
system was functioning properly and was ready for 
scanning production. 

— Otherwise, it was concluded that a problem existed with 
the scanning equipment, and corrective action must be 
taken before the start of regular production. 


The test batches were constructed with minimum sample 
size requirements in mind for each field type, such that the 
producer’s confidence level would be at least 95%. This was 
then used as a guide in selecting the actual questionnaires 
for each of the test batches. The minimum size was required 
for each field type in order to achieve the high efficiency of 
decisions in the scanning calibration test, while the 
Producer’s Confidence Level referred to the likelihood that 
the scanning system would pass the test for that field type 
when the system was functioning at the acceptable target 
level. The Upper Control Limit for each field type was 
computed assuming a +20 variability. This limit is lower 
than the customary +30 Upper Control Limits since the 
scanning calibration check was designed to be more 
sensitive in detecting smaller shifts at start-up than during 
normal production. 


2.2.3 Quick Capture and Quick Key 


Once the questionnaires had been scanned, the system 
would produce a digital image of each field along with an 
interpretation of its value and an associated confidence level 
for its recognition. The actual data capture then consisted of 
two processes: Quick Capture and Quick Key. Quick 
Capture was the automatic recognition by the system of all 
field images whose confidence levels were above a pre- 
specified threshold value. Quick Key consisted of the heads- 
up manual capture (by keyers working on terminals) of field 
images whose confidence levels were below the pre-set 
threshold value. 

Since under ideal circumstances, these two processes 
were expected to be relatively stable, the QC Procedures 
were again based on SPC principles and were developed to 
measure and monitor the quality of each of the processes. 
This QC approach consisted of a small sample check from 
the output of a sample of batches taken systematically over 
time and computing the error rates for each sample. These 
error rates would then be compared to rejection levels that 
were calculated by the system based on the expected quality 
standard and the sample size for that observation. A 
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decision was then made as to the acceptability of each of 
these sample measurements relative to the expected quality 
standard for that process. 

In the case of the Quick Capture operation, the machine 
may interpret a different value from the actual value for that 
field, and therefore, substitution rates were used to evaluate 
this process. These substitution errors are particularly 
serious since, if left unchecked, they may affect the 
recognition rate for many fields for a long period of time. In 
the case of the Quick Key operation, operators may make 
keying errors for many reasons such as lack of skill, poor 
training, fatigue, efc., and therefore, keying error rates were 
used to evaluate this manual process. For both of these 
processes, the quality measure was defined as ‘percent of 
fields in error’, across all field types combined. 

Within the two capture operations, there were two 
distinct categories for processing the scanned documents: 
Regular questionnaires and Outlier questionnaires. QC 
procedures were put in place for each category. A separate 
sample was required for each process, one for Quick 
Capture and one for Quick Key. The system could 
distinguish between Quick Capture and Quick Key fields in 
each sample questionnaire and maintain separate counts of 
these fields that had been captured under each process. 
These field counts eventually became the sample size for 
each sample. Each sample was then compared to its own 
threshold rejection rate, which was a function of the 
number of fields observed (i.e., the effective sample size) 
and the expected quality standard or target for that process. 
A decision would then be made to accept or reject the 
sample. The threshold rejection rate was equivalent to the 
standard Upper Control Limit (UCL) that would be 
calculated on a standard p control chart. If the sample error 
rate exceeded this level, the process was rejected and the 
QC Reviewer proceeded to investigate and implement 
corrective actions as appropriate; otherwise the process was 
accepted. 

The sampling was done on an individual scanner basis 
for Quick Capture and an individual operator basis for 
Quick Key. Some operators required more questionnaires to 
be sampled from time to time, and others less, based on their 
actual performance. Since the actual observations were 
based on samples, a customary +30 variability was 
permitted above the expected quality standard (i.e., the 
centerline of a p control chart) for each process. The batch 
decisions for these sample observations were made by the 
system during QC verification and these results were then 
plotted on a p control chart for each scanner and operator, 
after the fact and updated weekly. 

For a detailed description of these QA/QC procedures 
and their rationale, please refer to Mudryk, Bougie and Xie 
(2001). 
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3. QUALITY IMPROVEMENTS 


Two essential elements were included in the quality 
improvement strategy for the ICR Data Capture Operation. 
These consisted of feedback of QA/QC results and the 
implementation of corrective and preventive actions when 
required. These two elements enabled various staff to play 
an active role in improving the quality of each process 
through the additional insight into the problems that were 
identified and through the subsequent corrective or 
preventive actions that were taken. 

Using QC data analysis as the base, all processes were 
examined to determine if they were operating efficiently. 
QC meetings were held with operations staff on a weekly 
basis to review the ongoing progress of the entire operation. 
Problems that had impacted any of the processes were 
addressed and recommendations made to treat their root 
causes and prevent their re-occurrence. The involvement of 
operational staff in resolving these problems played an 
important part in facilitating quality improvements on a 
continuous basis. The following examples illustrate some of 
the more significant corrective actions that were taken 
during the operation that led to quality improvements at 
various stages. 


Example 1: Filtering Process for Detecting Outlier 
Documents 

During the first few weeks of production, it was noticed 
that some documents were causing a high concentration of 
errors from things like large X’s across a page, 0’s and 
dashes in various fields, etc. These documents were causing 
high error rates for both operations but especially for the 
Quick Capture process. Since these documents were very 
different from the majority of the regular documents, a 
procedure was introduced to sort these documents for spe- 
cial treatment and processing after the fact. Some docu- 
ments in fact had to be re-transcribed at this stage prior to 
processing them by ICR. 


Example 2: Adjusting System Settings for Scanning & 
Recognition 

The highlights of the QC weekly summaries indicated 
that both scanners made errors frequently on Pages 3 and 14 
of the questionnaires during the first few weeks of 
processing. An investigation was conducted and it was 
found that there was a template reading problem on Page 3 
and the pre-set recognition threshold level for the numeric 
fields on Page 14 were set too low. After the system settings 
on both scanners were adjusted, the system showed 
substantial improvements in the scanning of these two 
pages. 
Example 3: Retraining Operators with High Error Rates 

During the keying operation, the QC results showed that 
certain keyers were experiencing above average difficulties 
with the “key from image’ process and that their error rates 


remained high for several weeks. Focusing on continuous 
improvement, these keyers were offered retraining on an 
ongoing basis. As a result, many keyers made significant 
improvements (week by week) in their keying performance. 


4. QC EVALUATION AND ANALYSIS 


Throughout the operation, many QC reports, charts and 
estimates, were produced to provide information about the 
incoming and outgoing quality levels and to evaluate the 
output of each production process. These reports were used 
to analyse the quality of each process by week and across 
weeks. 


4.1 Document Preparation 


For each of the five sub-processes of the document 
preparation, individual QA procedures were applied at 
different frequencies and both corrective and preventive 
actions were taken on an on-going basis as dictated by the 
results. The information collected and the feedback that was 
provided as a result of these QA procedures helped 
significantly in improving the scanning, imaging, recog- 
nition and capture of the questionnaires. In the first few 
weeks of production, it was discovered from the QC results 
that problematic documents (i.e., outliers) were causing 
most of the substitution errors (i.e., machine errors) in the 
Quick Capture process. From that point on, a new procedure 
was introduced into the Sorting process of the Document 
Preparation operation to separate these documents for 
special treatment from the regular documents (i.e., labeled 
them for subsequent 100% verification). In general, better 
quality documents reached the scanning stations while 
poorer documents were either re-transcribed or processed 
separately with the addition of post processes such as 100% 
verification. 


4.2 Scanning Calibration Check 


In an effort to ensure optimal scanner settings and 
calibration, a Scanning Calibration Check was initially 
conducted twice a day, and subsequently once a day, prior 
to production processing. Many test batches were scanned 
during the operation with a relatively high rejection rate 
encountered by each scanner. On average, approximately 
2-3 tests per day (with corresponding re-calibrations) were 
required for optimising the set-up of each of the two 
scanners. This demonstrates the need for re-calibration 
between processing periods. It should be noted that some 
rejections occurred due to problems identified with the test 
batches which were fixed later on. This is definitely an area 
where some procedural improvement is required in the 
future. 
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Both scanners exhibited reasonably high variability 
during this test. The high number of tests required, high rate 
of rejection and high variability across processing periods 
for many of the field types demonstrate the need to calibrate 
the scanning equipment properly prior to production. 
Otherwise, the scanners could be inadvertently set up to 
produce poor images right from the start, which would 
make good quality capture very difficult. Once a test batch 
failed, problems were usually identified and subsequent 
maintenance and corrective actions taken. This included 
actions such as: re-configuring the scanning equipment, 
replacing old light bulbs, fixing software problems, cleaning 
dirty read heads, etc. Using this test, the scanners were able 
to be calibrated and maintained at optimum levels of 
performance, between production runs. 


4.3. Quick Capture and Quick Key 


For the Quick Capture process, over the entire 18 weeks 
of processing the Regular questionnaires, the overall 
weekly substitution error rates decreased steadily from 4.3% 
to 0.8%, resulting in a grand overall substitution error rate of 
2.0% (across all field types) for both scanners. The 
substitution error rates measured during production were 
maintained very near the Target levels that were established 
for each field type. These were as follows: Alpha (2.1% 
relative to a target of 2.0%); Alphanumeric (3.2% vs. 3.5%); 
Bar Code (0.0% vs. 0.2%); Numeric (2.8% vs. 2.0%) and 
Tick Boxes (0.8% vs. 0.4%). In comparison, processing the 
outlier questionnaires had a much higher substitution error 
rate and greater weekly variability than the corresponding 
regular questionnaires (i.e., ranged from a high of 22.4% to 
a low of 1.3%). Although the substitution error rate did tend 
to reduce substantially over time, it did remain relatively 
high throughout the process and was measured at 7.0% 
overall, which was significantly higher than the rate for 
regular questionnaires (i.e., 2.0%). 

For the Quick Key process, the keying error rate for 
processing the regular questionnaires was relatively high 


PA) 


throughout the entire processing period (i.e., mostly over 
3%). This was partially due to the fact that this operation 
was a heads-up keying process and these keyers typically 
processed the most difficult cases. Over the entire 18 weeks 
however, the weekly keying error rates generally decreased 
from 5.6% to 1.6%, with an overall average of 3.4%. The 
keying was also subject to high levels of variability among 
operators, with individual error rates ranging 1.7% to 7.5%. 
It is interesting that keying the outlier questionnaires had a 
similar keying error rate to the corresponding regular 
process (i.e., 3.4% vs. 3.7%) and ranged from a high of 
5.7% to a low of 1.6%. 


4.4 Estimates of Average Outgoing Quality 


The primary purpose of the QA/QC procedures was to 
identify problems and to prevent them from occurring again. 
However, these procedures also had a corrective component 
in the sense that, errors that were discovered were always 
rectified. It is therefore possible to estimate the overall 
Average Outgoing Quality (AOQ) for the data capture 
component after the application of the QC procedures. 

Estimates of AOQ were calculated for each of the two 
data capture processes. For a sampled outlier batch, all the 
questionnaires (i.e., sampled and remainder) in that batch 
would be subjected to subsequent 100% verification, while 
for a regular batch, only the sampled questionnaires would 
be verified. This affects the calculation of AOQ since it can 
be assumed that the outgoing error rate for all verified 
questionnaires is 0.0%. The overall estimate for each 
component was based on the information obtained from 
both the regular and outlier documents, considering 
estimates of incoming quality and corrections made during 
verification. In the calculation, any documents reprocessed 
through either Quick Capture or Quick Key were included 
in the count. 

Table 1 provides estimates of the AOQ for the Quick 
Capture and Quick Key processes. 


Table 1 
Estimates of AOQ for ICR Data Capture 


No. Questionnaires No. Fields in 


Process in Population Population 
Quick Capture 
Regular 273,818 21,248,277 
Outlier 12,702 1,044,358 
Overall 286,520 22,292,635 
Quick Key 
Regular 281,502 6,376,020 
Outlier 25,788 686,734 
Overall 307,290 7,062,754 
Combined 
Regular 27,624,297 
Outlier 1,731,092 
Overall WE pyerssy 


Estimated 
No. Fields Verified Incoming Error AOQ (%) 
and Corrected (%) 

170,249 2.01 1.99 
1,044,358 6.99 0.00 
1,214,607 2.95 1.90 

234,253 3.41 3.28 

686,734 3.67 0.00 

920,987 3.45 2.97 

404,502 DES 2.29 
1,731,092 5.09 0.00 
2,135,594 3.24 2.16 
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It can be seen that the overall AOQ for the Quick 
Capture process was estimated at 1.90% and for the Quick 
Key process at 2.97%. This was down considerably from 
their corresponding estimates of incoming quality of 2.95% 
and 3.45% respectively. The overall AOQ for both 
processes was estimated at 2.16% (relative to an overall 
incoming error quality of 3.24%). It should be noted that the 
AOQ for outlier documents was assumed to be 0% since all 
outlier documents were subsequently 100% verified. 


4.5 QC Summary 


The above results clearly indicate the need for the 
QA/QC procedures at the different stages of processing. It 
also shows how they collectively contributed to controlling 
the outgoing quality and generating quality improvements 
into all phases of the ICR data capture operation. 

The QC results clearly showed that the outlier documents 
had a greater negative impact on the Quick Capture process 
(.e., 7.0% substitution error rate) than the Quick Key 
process (i.e., 3.7% keying error rate). This indicates that the 
filtering process for special treatment of outlier documents 
was an important step to take. The QC results also showed 
that if the documents were in good shape for scanning and 
the machines were well calibrated, the automated system 
was capable of capturing the data faster and with better 
quality than the manual key from image process. This is 
quite an important observation, since there are obvious 
savings implied with a corresponding improvement in data 
capture quality (7.e., 2.0% vs. 3.4%). To the defence of the 
keyers, however, they did process the more difficult cases, 
thus partially explaining their higher error rates. Overall, it 
was estimated that about 77% of the fields were captured 
through the Quick Capture process and 23% were captured 
through the Quick Key process. 

It should also be noted that the regular feedback of the 
QC information collected from the various stages of the ICR 
process was essential in identifying the root causes of many 
problems and in helping to resolve them. This provided the 
opportunity for many quality improvements to be generated 
into the various stages, on an on-going basis. 

For a detailed description of these QA/QC results, please 
refer to Mudryk and Xie (2002). 


5. CONCLUSIONS 


It is clear from the results obtained in this analysis, that 
the QA/QC procedures were extremely valuable and had a 
very positive impact on the entire operation. The QA 
procedures that were applied in the Document Preparation 
process were effective in preventing many poor documents 
from reaching the scanning stations and those that did were 


Mudryk and Xie: Application of Quality Control in ICR Data Capture 


then labeled for special treatment and subsequent 100% 
verification. 

The QC procedures were then able to optimize the 
machine set-up by applying the Scanning Calibration Check 
prior to production. Furthermore during production, QC 
samples were also able to identify problems with the auto- 
matic recognition and key from image processes, so that 
they could be improved as required. 

In all cases, early warning signals were obtained from 
objective measurements at each stage of processing, and 
corrective and preventive actions were implemented as 
needed. Extensive feedback was provided to all stages of the 
ICR process on an ongoing basis from which continuous 
quality improvements were generated. 


APPENDIX 


ICR Data Capture Operation 
(with QA/QC) 
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Design Effects for the Weighted Mean and Total Estimators 
Under Complex Survey Sampling 


INHO PARK and HYUNSHIK LEE ' 


ABSTRACT 


We revisit the relationship between the design effects for the weighted total estimator and the weighted mean estimator 
under complex survey sampling. Examples are provided under various cases. Furthermore, some of the misconceptions 


surrounding design effects will be clarified with examples. 


KEY WORDS: Simple random sample; pps sampling; 


Intracluster correlation coefficient. 


1. INTRODUCTION 


The design effect is widely used in survey sampling for 
developing a sampling design and for reporting the effect of 
the sampling design in estimation and analysis. It is defined 
as the ratio of the variance of an estimator under a complex 
sampling design to that of the estimator under simple 
random sampling with the same sample size. An estimated 
design effect is routinely produced by computer software 
packages for complex surveys such as WesVar and 
SUDAAN. It was originally intended and defined for the 
weighted (ratio) estimator of the population mean (Kish 
1995). However, a common practice has been to apply this 
concept for other statistics such as the weighted total 
estimator often with success but at times with confusion and 
misunderstanding. The latter situation occurs particularly 
when simple but useful results derived under a relatively 
simple sampling design are applied to more complex 
problems. In this paper, we examine the relationship 
between the design effects for the weighted total estimator 
and the weighted mean estimator under various complex 
survey sampling designs. In section 2, we briefly review the 
definition of the design effect and its practical usage while 
discussing some of the misconceptions surrounding design 
effects for the weighted total and mean estimators. 
Subsequently, in section 3, we analyze the difference 
between the design effect for the weighted total estimator 
and that for the weighted mean estimator under a two-stage 
sampling design followed by a discussion regarding the 
design effects under various two-stage sampling designs and 
some more general cases in section 4. We try to clarify 
some of the misconceptions with these examples. Finally, 
we summarize our discussion in section 5. 


1 


Multistage sampling; Self-weighting; Poststratification; 


2. A BRIEF REVIEW ON DEFINITION AND USE 
OF DESIGN EFFECT IN PRACTICE 


A precursor of the design effect that has been 
popularized by Kish (1965) was used by Cornfield (1951). 
He defined the efficiency of a complex sampling design for 
estimating a population proportion as the ratio of the 
variance of the proportion estimator under simple random 
sampling with replacement (srswr) to the corresponding 
variance under a simple random cluster sampling design 
with the same sample size. The inverse of the ratio defined 
by Cornfield (1951) was also used by others. For example, 
Hansen, Hurwitz and Madow (1953, Vol. I, pages 
259 — 270) discussed the increase of the relative variance 
of a ratio estimator due to the clustering effect of cluster 
sampling over simple random _ sampling without 
replacement (srswor). The name, design effect, or Deff in 
short, however, was coined and defined formally by Kish 
(1965, section 8.2, page 258) as “the ratio of the actual 
variance of a sample to the variance of a simple random 
sample of the same number of elements” (for more history, 
see also Kish 1995, page 73 and references cited therein). 

Suppose that we are interested in estimating the 
population mean (Y) of a variable y from a sample of 
size m drawn by a complex sampling design denoted by 
p from a population of size M. Kish’s Deff for an 
estimate (y,,) is given by 


V,, (Fp) 


ene ee eee Dell 
(1— f)S?/m ake 


where V,, denotes variance with respect to p, f =m/M 1s 
the overall sampling fraction, and S$ = (M -1)" 
yi.(y, ~¥)’ is the population element variance of the 
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y-variable. Although the design effect was originally 
intended and defined for an estimator of the population 
mean (Kish 1995), it can be defined for any meaningful 
statistic computed from a sample selected by a complex 
sampling design. 

The Deff is a population quantity that depends on the 
sampling design and refers to a particular statistic estimating 
a particular population parameter of interest. Different 
estimators can estimate the same parameter and their design 
effects are different even under the same design. Therefore, 
the design effect includes not only the efficiency of the 
design but also the efficiency of the estimator. Sarndal, 
Swensson, and Wretman (1992, page 54) made this point 
clear by defining it as a function of the design ( p) and the 
estimator (6) for the population parameter (0 =@(y)). 
Thus, we may write it as 


V,, (8) 
eae (6 ’) 


where 6’ is the usual form of an estimator for @ under 
srswor, which is normally different from 6. For example, to 
estimate the population mean, one may use the weighted 
(ratio) mean 6=), w,y,/d,w, With sampling weights 
w, but 6’ would be the simple sample mean >, y, /m, 
where the summation is over the sample s. We will see the 
effect of particular estimators 6 on the design effect in the 
later sections. 

Kish (1995) later advocated using a somewhat different 
definition, which is called Deft and uses the srswr variance 
in the denominator on the ground that without-replacement 
sampling is a part of the design and should be captured in 
the definition. He also reasoned that Deft is easier to use for 
making inferences and that it is better to define the design 
effect without the finite population correction factor (1 — f) 
because the factor is difficult to compute in some situations. 
The new definition is given by 


Z v_(6) 
Deere | ee 
Ee WW ig) 


or Deft’, (6) =V, (6) / a (6’). Survey data software 
such as WesVar and SUDAAN produce Deft’ instead of 
Deff. We will use this definition in this paper. 

When the population parameter is the total (Y), the 
unbiased estimator is the weighted sample total, namely, 
Y= >, W,y,- When the population mean is the parameter 
of interest, it is usually estimated by the weighted mean, that 
is, Y=>,w,y,/d,W,. It is a special case of the ratio 
estimator, )),w, y, /D,W, X,, where x, =1 forall kes. 

One common misconception about the design effects for 
Y and Y is that they are similar in values. However, it has 
been observed that the design effect for Y, Deft’ (Y), 


Deff,, (6) = 


tends to be much larger than that for Y : Deft®, (¥ ).. This 
was also noted in, for example, Kish (1987) and Barron and 
Finch (1978). Some explanation can be found in Hansen 
et al. (1953, Vol. I, pages 336 — 340) who showed that the 
difference arises from the relative variance of the cluster 
sizes. More recently Sarndal er al. (1992, pages 315 — 318) 
showed that contrary to the case of Y, the design effect 
for Y depends on the (relative) variation of the y-variable. 
In fact, even the design effect for Y may depend on the 
(relative) variation of the y-variable, which we will discuss 
in section 4. This dependence contradicts what the design 
effect is intended to measure as Kish (1995) explicitly 
described: 


“Deft are used to express the effects of sample design 
beyond the elemental variability OS /m), removing both 
the units of measurement and sample size as nuisance 
parameters. With the removal of S,, the units, and the 
sample size m, the design effects on the sampling errors 
are made generalizable (transferable) to other statistics 
and to other variables, within the same survey, and even 
to other surveys.” 


_ His statement may be loosely true for the weighted mean 
Y as expressed in the frequently used sample approximate 
formula for Deft®, ( p, Y) given by Kish (1987): 


Deft? (¥) ={1+p(m-1) }(1+ cv?) (2.2) 


where the sample design p contains complex features such 
as unequal weighting and cluster sampling, p =p, (y) is the 
intraclass correlation coefficient (often called within cluster 
homogeneity measure), m is the average cluster sample 
size, and cv. is the sample relative variance of the weights. 
Strictly speaking, this formula is not independent of the 
y-variable because p is dependent on the y-variable. Also, 
the design effect may not be free of the unit of measurement 
unless V, (Y) is expressed in a factorial form of S* /m. 
See Park and Lee (2002). This formula (2.2) is valid only 
when there is no correlation between the sampling weights 
and the survey variable y. However, if the correlation is 
present, the formula may need to be modified as studied by 
Spencer (2000) and Park and Lee (2001). In the following 
section, we elaborate this aspect in detail for two-stage 
sampling and we will also examine this point further in 
section 4.1. 


3. DECOMPOSITION OF THE DESIGN EFFECT 
UNDER TWO-STAGE SAMPLING 


We consider a sampling design conducted in two stages. 
Suppose that a population U={k:k =1,...,M} with M 
elements is grouped into N clusters of size M, such that 
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M =>", M,. The first stage sample s, ={i:i=1,...,n} 
of n piacere (primary sampling units, or PSUs in 
abbreviation) is selected with replacement from WN clusters 
with probabilities p,, where )., p; =1. Let p, =Pr(s,) 
denote the first stage sampling design. The second stage 
sample s,,={j: j=1,...,m;} of m, elements (secondary 
sampling units or SSUs in abbreviation) is then selected 
independently from each PSU i selected at the first stage 
according to some arbitrary sampling design, say p,; = 
Pr(s,;|s,,) where i€ s,. Denote the total sample of ele- 
ments and the overall sampling design by s = Vies, S,; and 
p =Pr(s), respectively. Associated with the j™ element in 
the 7” cluster is a survey characteristic Vij» 1 ew Oe a0 ee 
i=1,...,N. Fora given ie s,, let w,, be the cond stage 
ae weights such that an estimator of the form Y= = 

w, Y, is unbiased for the cluster total Y, =’, y,, 
that j is, (E 04 )=Y;,, where EF, represents the expectation 
with respect to the second stage sampling. Let w, =1/(np,) 
be the first stage sampling weights and let Y=>%_,Y, be 
the population total. It is easy to show that E, (Y,/p,) =Y. 


Assuming that Y, are known for ie s,, >/_,w,Y, is the 
average of mn unbiased estimators of Y so _ that 
E, (dj: w; ¥,)=Y, where E,, denotes the expectation with 


respect to the first stage sampling design. Note that both 
stages are sampling with replacement. Accordingly, it is 
possible that the same sampling unit (either cluster or 
element) is selected more than once but they are treated 


differently. Define the overall sampling weights by 
Wy = Ww; W,;- Clearly, Y= Yi-1 UW, Yy 18 unbiased for 
Ween that AS;< Aan os ee OO tary OE an Aron 


where Ee represents the expectation ie respect to p. The 
variance a Y canbe written as 


V,(Y) =V,E, (Y) + E,V,(Y) 


N N 
=) w,(¥, -p,Y)° +> wV,(¥,) 


i=l i=] 


(Bet) 

where V,,V, and V, represent variances defined with 

respect to the overall, the first stage, and the second stage 

sampling. See Sarndal et al. (1992, pages 151 — 152). 

A commonly used estimator for the population mean 
Y/M is the weighted (ratio) estimator given by 
Y/M where M=>"_,>” i,w,- Using Taylor linear- 

ization, as shown in Sarndal et al. (1992, pages 176-178), 

Y can be approximated as 


Y=Y+m"D 


where D = Dp DLA . is an ta estimator of the 
population total oe in rou of d,=y,—Y, which 
represents the deviation of a ee the population mean 
Y. Note that D= : Denoting D;=>".d, =Y,-M,Y 
and DEES 7.1 Wd, we obtain the approximate variance 


of Y from expression (3.2) as 


lg 
y= 


(3.2) 


185 


wv, Om gfo| Sm My) +¥ w Yb, ! (3.3) 


If a simple random sample of size m=>?_,m, is 
selected with replacement from the population U, then a 
sample mean y,. =>, y, /m and its expansion 


1 
fee YK 


would serve as the estimators of the population mean Y 
and total Y, respectively, under srswr, where f =m/M_ is 
the overall sampling fraction. Their variances under this 
sampling design are er as Vicia WES ches. ( Yass) 
where Verowe Vars) = m Ss and So=(M.— 1G 

DEC We —Y)*. We note that m is the achieved sample size, 
which is a random quantity in general. From (3.1), (3.3), and 
above expressions with m replaced by its expected value m’ 
with respect to the overall sampling design p, i.é., 

m = E,(m), the design effects for Y and Y can be written 
as 


eae M yeu (3.4) 


Doyle Ne 
Deft, Y)=— a 


and 


Diaper wlll 
eee = 


where CV? = Sy “/¥* represents the population relative 
variance of the y-variable. From these expressions, the 
difference in design effects for Y and Y can be written as 
follows. 


Deft? (Y) — Deft? (Y)=A, +A,, (3.7) 
where 
oa Y, See hal ey 
Nees S hilt ae gpa 
aN a ee Y Y M 
and 


a aefiehe-(el 


The two components A, and A, in expression (3.7) 
reflect the differences arising from the respective sources of 
variation from the first and second stages of sampling. Of 
course, the second component disappears if all the elements 
in selected clusters are observed since it becomes a single- 
stage design or if a simple random sample is selected in the 
second stage. This is because both variances V,(¥,) and 
V, (D, yrare equivalent under the aforementioned conditions, 
that is, 1) V, (Y.)= Wa (D,) =0 if w,; =1 for all i and j, and 
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2) V.)=V, e0 if Ww; 
other words, 
A, =0 if 


); =M;/m, for all i and j. In 


Ww ,, = ¢; for alli and j, (3.8) 


where c, are nonnegative constants and not necessarily 
equal for different clusters. Meanwhile, we can show that 

0 ay Pa a) de 

A aids ip CY) Mate Ml 2 

=p {O) sal 


G.9) 
pyeek;: 


for all i, where A,(y)=(m'/CV,)>Mw,(p;-M;/M)’. 
Note that A,(y) is a nonnegative quantity and also that the 
conditions in expression (3.9) can be restated, respectively, 
asp, = 7M, Y, =F" andup, 7/1) where Y, = YIM 
for all i=1,...,.N. This result reveals the effect of cluster 
sampling on the precision of the two estimators. For 
example, if p,=M,/M, cluster sampling makes no 
difference in the precision of the two estimators. On the 
other hand, if p; =Y,/Y, Y becomes more efficient than ¥ 
in precision under cluster sampling, whereas the cluster 
sampling favors Y over Y in terms of precision if Y, =Y 
for all 7. 

Now, let us consider some examples of the conditions of 
(3.8) and (3.9). 


Example 3.1 For one or two-stage cluster design with pps 
cluster sampling using p;=M,/M and w,) =c, for all 
i=1,...,N, we have from (3.8) and (3.9) that A, = 
A,= br nice is, there is no difference in the design effects for 
Y and Y. 

The same result as given in example 3.1 can be achieved 
by Y =MY. This estimator is the ratio estimator, which 
can be used if M is known. The case that overall sampling 
weights are a constant for all the elements (ie., self- 
weighting sampling design) is a well known special case. 
We will come back to this in section 4. 


Example 3.2 One-stage simple random cluster sampling or 
two-stage sample design with srs for both stages. Under 
these designs, we have w,,,=c, and p, =1/N forall 7 and 
j and thus, it follows from (3.8) and (3.9) that A, =0 and 


0) if M, =M, for alli, 
EO Ves ade = 
a= : if Y, are all equal, (3.10) 
CN 
WOVE 
— Fil : Y, are all equal, 
GV- 


where m’=m’/n,CV,;, =M~>%,(M,—M)’/N _ denotes 
the relative variance of cluster sizes M,, and M=M/N 
denotes the average size of clusters. The conditions in (3.10) 
also satisfy the conditions in (3.9) and therefore, (3.10) is a 
special case of (3.9). Note that the quantity A,(y) in 


expression (3.9) approximately reduces to mm’: CV; / Gye 
when p; =1/N for all 7. 

Example 3.2 shows that when unequal cluster sizes are 
not reflected in the sampling design, the relative efficiency 
of Y over Y depends in part on the relative variability of 
cluster sizes. If the cluster means are all equal, then cluster 
sampling makes Y more efficient than Y, vice versa if all 
the cluster totals are equal. On the other hand, if all clusters 
are equal in size, no difference in the design effects arises by 
simple random sampling of clusters. 

In section 4, we utilize the results derived in this section 
to discuss other examples used in the sampling literature. 


4. EXAMPLES ON THE DESIGN EFFECT IN THE 
SAMPLING LITERATURE 


4.1 Unequal Probability Element Sampling 


Consider an unequal probability element sampling design 
without clustering. The discussion in section 3 applies to 
this example with M,=1 for all i=1,...,N and thus, 
m=n. For brevity’s sake, we use lower cases y, to denote 
the value of the y-variable, and we also assume that N is 
large so that N/(N-1)=1. Due to the absence of the 
second stage sampling variation, the design effects for ¥ 
and Y given in oe (3.5) and (3.6) reduce to 


Dp (a- Bg 

Deft’, (Y) = — (4.1) 

N(y;-Y)° 
t=] 
and 

N 
pay OPA 

Deft’ (Y)= -- (4.2) 
DaN url 


Further let us consider an example where the survey 
variable y is not correlated with the selection probability p,. 


Example 4.1 Unequal probability element sampling with no 
correlation between y, and p,. When y; que p; are not 
correlated, we can approximate >), p, Hi y,-Y )° by 
nW >™.(y,;-Y)*, where W=N'>,w,. Note that 
E, (n” Dh.w,)=Nin, E, (n Yhiwe)=NWin and 
E,(n™ Y2w;)/ Eo (n™ >?_1w,)=nW IN. Thus, 
Deft? (Y)=nW/N 
=| n 

5 yp, ) 43) 


w?) [3 (a 


It is easy to show that nW/N21 using the Cauchy- 
Schwarz inequality (Apostol 1974, page 14). In addition, 
routine calculations show from (4.1) and (4.2) that 


de =i 
ee hr 
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Deft? ()— Deft? (¥) 


=CVi1D" pie, -B)?-2¥ DS 0, -Pw-Pf 
=CV,*(nW/N-)), 


where p=N'>™,p,=1/N. The latter expression is 
obtained from >”, p;'(p,—p)’ =nW/N-1 and YN, 
p, '(¥,;-Y)(p;-P) =0 because y, and p, are uncorre- 
lated. Consequently, 


Deft’ (Y) — Deft* (Y)= CVC { Deft? (¥ 1} 
Or 


Deft? (Y) =(1+CV.”) Deft? (Y)-CV>?. (44) 


From (4.4), it is clear that Deft’, (Y) = Deft’ (Y) if 
Deft’, (Y)>1 and the equally holds ‘if Deft’, (¥ )=1 or 
W= Nin. Also, Deft? A ix Deft” Lite if A+CV? )< 
Deft? wa ae 

Example - 4.1 shows that Y tends to have a larger design 
effect than % if the correlation between y, and p, is weak 
and Deft’ Ye iee 

The nati quantification of the effect of unequal 
weights on the design efficiency shown in (2.2) is due to 
Kish (1965, 11.7). He considered cases where the unequal 
weights arise from “haphazard” or “random” sources such 
as frame problems or non-response adjustments. Assuming 
that (1) a random sample of size n selected with replacement 
is divided into G weighting classes such that the same 
weight w, is assigned to n, sampling units within class g 
and n= >¢.,n,, and that (2) all G weighting class 
variances are equal to the unit variance of y, i.e., S$ j any : 
for all g =1,...,G, he proposed a quantity given as- 


2 
i G G 
Deitee (=n yn, caps n, v.| , (4.5) 
gal oo 

to measure the increment in the variance of Y in 
comparison with the hypothesized variance under srswr of 
size n. The rationale behind the above derivation is that the 
loss in precision of Y due to haphazard unequal weighting 
can be approximated by the ratio of the variance under 
disproportionate stratified sampling to that under the 
proportionate stratified sampling. 

In (4.5), letting n, =1 for all g and thus, n=G, Kish 
(1992) later proposed a well-known approximate formula 
given as 


Deftyin (F)=n dw / [s “| =I+ev,, (4.6) 
i=] tel 


where evi Sie. er Gr: —w)’/w is the sample relative 
variance and w is the sample mean of w;. Note that (4.6) 
is a sample approximate of (4.3). For a sampling design 
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which is inefficient for estimation of Y, the inefficiency 
diminishes with the ratio estimation. Next, we consider the 
opposite case where the y-variable is correlated with the 
selection probability p,, where the efficiency of y 
increases. 


Example 4.2 Unequal probability element sampling where 
y, 1s correlated with p,. Suppose that y, is linearly related 
with p; by y;=A+Bp,+e,, where A and B are the least- 
square regression coefficients of the model for the (finite) 
population and e, is the corresponding residual. Further- 
more, assume that the regression model fits well to the 
population data and the error variance is roughly homo- 
geneous so that R,, = 0 and R, = 0, where R,, and 
R.,,, denote the ponte correlations of pairs o Ww, ") and 
(e? ow), respectively. For ee (ewe 

(e,-E)(w,-W)/{(N-1)S,S,,}, where Sa Halt NE 
S, and S,, are the population standard sete of e, and 
w;, respectively. Then the design effects given by (4.1) and 
(4.2) reduce to 


Deft’ (Y) =(nW/N) (1- R>,) 


yp 


coe ee 1 ‘ 
WN 2M) ee 
+(nW / le | 


Cv, (4.7) 


and 
Deft” (Y) =(nW/N) (eR) 


Re 
= 9 (4.8) 
CV, 


respectively, where R,, is the population correlation 
between y, and p, and CV, is the population coefficient 
of variation of p; (see Park and Lee (2001) for proof). It 
follows from (4.7) and (4.8) that Deft’ (Y) > Deft? (Y) if 
and only if 


vaWFN-1) 


2Rey SiC, /CVes (4.9) 


where the equality holds if and only if 2R,, =CV,,/CV.,. 
Also, the inequality is reversed when the memes | in (4. 9) 
becomes opposite. 

The condition (4.9) indicates that y tends to be less 
efficient in terms of precision than Y whenever R,, is 
small. Thus, we see that R,, plays an important role in 
determining the design efficiency of unequal probability 
sampling on Y and Y and their relative efficiency. 

In an attempt to develop an approximate expression to 
the design effect when y, is correlated with p,, Spencer 
(2000) proposed a sample approximate formula for Y and 
compared it with Kish’s approximate formula (4.6) for the 
special case of R,, =0. As seen in example 4.2, the two 
design effects (4: 7) and (4.8) are not equal unless 
W =N/n (see Park and Lee (2001) for more discussion 
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and some numerical examples). In addition, this special case 
provides the same condition as for example 4.1 and thus, the 
two approximate design effect formulae (4.7) and (4.8) are 
equivalent to (4.4) and (4.3), respectively. 


4.2 One-Stage Cluster Sampling 


Consider a one-stage cluster sampling, where every 
element in a sampled cluster is included in the sample, i.e., 
m,=M, for all ies,. Due to the absence of the second 
stage sampling variation, the variance of Y takes only the 
first term of expression (3.1) and it can be decomposed as 


aan 1) 


> mY pas HD Si OTE (4.10) 


where S*,=(N-1)'>™,M,(¥,-Y)? and Q,= 
MeiM=pM) tor v=, ..,N-- Noe thar 0,0 "ir 
p; =M,/M, that is, p, is proportional to the cluster size 
M,. Also, note that S*, is the between-cluster mean square 
deviation in an analysis of variance. Denoting the within- 
cluster mean square deviation as S\y 2 =(M-N) yee 
pape ys Y.)’, write Sos a+3(M — N)KN—-1)} 
vue 5=1-S*,,/S°. Since the expected sample size is 

=nM, the sihieh effect for Y can be written from 
Lie as 


Dette 


(4.11) 


(4.12) 


We observe that the design effect for Y differs from that for 
Y in the second term containing D. inl Vij —Y) instead 
of Y;. In addition, we note that the quantity 6=6,(y) is 
the adjusted coefficient of determination (Ria) in the 
regression analysis context. It may be called a homogeneity 
measure. For more discussion on 6, see Sarndal etal. 
(1992, pages 130—131) and Lohr (1999, page 140). 


Example 4.3 One-stage simple random sampling of 
clusters. In this example, if p, =1/N for all 1=1,...,N, 
the two design effects in (4.11) and (4.12) reduce, 
respectively, to 


Deft* (Y) (>) [1+ 


ad 
N-1 


N-CV? 
and 
Dettege ps Saha is 
P N cer 
N M.\D.\ 
Seats Mi | 2), 414) 
ae e M \Y 
where M=M/N. _ Since Deft’, (Y )— Deft’, Yee 


ee Me, —M )QY,= Y), the eal between bc 
effects for Y and Y depends on the joint distribution of Yi 
and M,. 


Example 4.4 One-stage simple random sampling of clusters 
of equal-size. In this case, we have M,=M, and 
p; =1/N for all i=1,...,N and both design effects in 
(4.13) and (4.14) can be approximated by the same quantity 


given as 
- N(M, -1 
[Aa * el ) 5 | 
N Net 


since M,-M =0 forall i=1,...,N. 

To introduce the clustering effect on variance estimation, 
one often uses the simplest form of one-stage simple 
random cluster sampling as in example 4.4. For example, 
see Cochran (1977, section 9.4), Lehtonen and Pahkinen 
(1995, page 91), and Lohr (1999, section 5.2.2). Although 
these authors adopted a without-replacement sampling 
scheme, we compare their formulae with our formulae with 
the with — replacement sampling assumption for the sake of 
both simplicity and consistency. Furthermore, the compar- 
ison is valid because their formulae are defined with the 
finite population correction incorporated in both numerator 
and denominator so that its effect is basically cancelled out. 
Cochran (1977, section 9.4) derived 


(4.15a) 


=] 
Nop Ut Mo -Del 
5 


Deft? (Y) = 
(4.15b) 


where p is called the intracluster correlation coefficient 
defined by 


Mo 


25) > vg -YO4 =) 


i=l j>k=1 


prone Mba Oe td Bt ley (4.15c) 


N My a 
(M,-)>) >) OG, -¥)? 


i=l j=l 
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Rewriting = DL, [E21 0; -Y)/ = M,(N-)S%, and 
Ny DM (yy -Y¥)°=(NM -1)S; = (NDS), +N(My-1) 
S sy , itis easy to show that 


Mo 


2), » Og 


i=l j>k=l 


N Mo N Mo 
-¥|¥0,-7] - ¥ ov- 
vat al*’ t7=1 


i= 


-Y\(yv4 -—Y) 


=(M, -)\NM, - 1S? - aoe 


and, thus, from (4.15c), p=l—-{NM,/(NM,-1)} 
ay So “)=6 assuming M,=M, for all i=1,...,N, 
NM, (NM, —))=1. Therofore futiber assuming (N —1)/ 
Nel and (NM, -1)Mj'(N- 1)'=1, both design effect 
formulae (4.15a) and (4.15b) are approximately equivalent 
to 1+(M,-1)d. Other authors arrived at the same 
approximate formula. This is because 6 and p essentially 
measure the same thing, which is the cluster homogeneity. 
Under this situation, two estimators Y and Y have the 
same design effect as discussed in example 3.2. Note that 
this is a simple case of a self-weighting sampling design. 

Sardal etal. (1992, section 8.7) compared the design 
effects for the two estimators under the setting of example 
4.3. They also derived a simplified expression 1+(M —1)8 
for (4.13) and (4.14), assuming the covariances of M, with 
M,Y,° and M, D, are ignorable. Their discussion on the 
difference between total and mean estimators boils down to 
A, in example 3.2. They also noted that the design effect 
can be much more severe for the population total than for 
the population mean because more is lost through sampling 
of clusters when the total is estimated than when the mean is 
estimated. 

A common practice to handle unequal cluster sizes is to 
use a more efficient sampling method that incorporates the 
size difference such as pps sampling of clusters. Expressions 
(4.11) and (4.12) can be applied to arbitrary selection 
probabilities p,, where p, are set to be proportional to 
some size measures Z, 20. The difference between the 


design effects for Y and Y is explained by A, in (3.9), or 
alternatively 
m & w.d.\(Y. ei bhls 
Ave: YS Ss] -]/H] i 416) 
CVe ia, E> Mak rs 


The term Q, in (4.16) represents the effect of p,; on the 
variance estimation when size measures other than the 
actual cluster sizes M, are used. Thomsen, Tesfu, and 
Binder (1986) considered the effect of an out-dated size 
measure among other factors under two-stage sampling with 
simple random sample of element at the second stage. We 
will come back to this in section 4.4. 


189 


4.3 Self-Weighting Designs 


In a self-weighting sample, every sample element has the 
same weight. This leads to simple forms for both total and 
mean estimators. They are given by Y= pip n and 
Y = y/m, where f =m/M is the overall sampling fraction 
and y=, i", y, is the sample total. Then just like 
simple random sampling as shown in (3.4), the two 
estimators have the same design effect. 

A self-weighting sampling design can be implemented in 
various ways by synchronizing the first stage sampling 
method with the second stage sampling method (e.g., Kish 
1965, section 7.2). For example, if equal probability 
sampling is used for the first stage sampling, then the 
second stage should be sampled by an equal probability 
sampling method with a uniform sampling fraction for all 
PSUs. As a special case of this, where an srs of PSUs of 
equal size (i.e., M;=M, for all i) is selected, Hansen 
et al. (1953, Vol. I, pages 162 — 163) showed 


2 4,_ 1 2) sea) 
CV, (Y)=—CV, [i+ p@m —1)], (4.17) 
where Cv = V(Y)/ Y° is the relative variance of Y 
under the sampling design p and p is the intracluster 
correlation coefficient as defined in (4. eles Since the 
relative variance of Y under srswr is m™ a he the well 


known approximate design effect formula for Y under a 
self-weighting design follows immediately as 


Deft? (Y)=1+ p@m-1). (4.18) 


For one-stage cluster designs, we showed similar forms 
given in (4.15a) and (4.15b) (see also Yamane 1967, section 
8.7). Hansen et al. (1953, Vol. Il. page 204) further showed 
eae (Y) = CV, (Y ) for a sample design that employs 
simple random Soins at both stages. This implies that Y 
and Y have the same design effect. 


4.4 Two-Stage Unequal Probability Sampling 
Let us first consider the following example. 


Example 4.5 A two-stage sampling design where n PSUs 
are selected with replacement with probability p, and an 
equal size simple random sample of m,22 elements is 
selected with replacement from each selected PSU. With 
routine calculations and simplification, we can show that 


Deft’, (Y) =1+(my -1)t+W,, (4.19) 
where 
(N-DS;, +Y (m, -1)"S;, 


ee 


—1)S\, + (M , -1)S;, 


(4.20) 


Si =(M, —-) Ag shy Wi = W, Var aalc 
(m, /CV? yo) (C1 pM - eee ae (1+ CV, /m,), and 
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CV, =S$ /Y,° denotes the within-cluster relative variance 
of the y-variable. Similarly, 


Deft? (Y)=1+(m, -)t+Wy, (4.21) 


where W, =W,/V...,, CNR = (my {Ev ee (OulpeM *) 
(D,/Y)°U+CVj,/m,), and D, and cv: are defined 
with the transformed variable d (d; =y, —Y) analogously 
to Y, and even respectively. (Detailed derivations of 
expressions (4.19) and (4.21) are available from the 
authors.) For the case with m; =m, for all 7, the difference 
in the design effects given in (4.19) and (4.21) reduces to 
(3.7) or (4.16). There is no contribution from the second 
stage sampling to the difference. 

Coming back to Thomsen et al. (1986) who studied the 
effect of using an outdated measure of size on the variance, 
the above discussion on Y parallels with their discussion. 
The only difference is that they assumed a without- 
replacement sampling scheme at the second stage. Note, 
however, that the definition of t in Thomsen et al. (1986) is 
slightly different from (4.20) and from 6 in section 4.2. 
However, there is a close connection between them. To see 
this, let us write the t as a function of some quantities b,’s 
associated with PSUs as follows: 


(N-)DS<, 


-¥ 483 yl 


(N-1DS%, ey (M ; -1)S;, 


t(b,) = 


Then the t in Thomsen et al. (1986) is obtained with b, = 1, 
the t in example 4.5 with —1/(m,—1), and 6 in section 4.2 
with (M;,-1)/{X%, (M,-DMN- 1)}.  Equating Kish’s 
formula (4.18) for Y to (4.19) for Y, they obviously over- 
looked that the design effects for Y and Y can be very 
different. 

For more general cases, Kish (1987) proposed the 
following popular formula for Y : 


g 2 
ny Nn, We 


Det. 
G 
De We 
g=l 


=(1+cv2)|1+ p@m —-1) |. 


This was obtained by applying (4.5) (or (4.6)) and (4.18) 
recursively to incorporate the effects of both clustering and 
unequal weights. Gabler, Haeder and Lahiri (1999) justified 
the above formula for Y using a superpopulation model 
defined for the cross-classification of N clusters and G 
weighting classes. However, the difference between the 
design effects for Y and Y cannot be exposed by such a 
model-based approach, since y, is treated as a random 


variable while w, as fixed. Under this approach, Deft’ (Y ) 


~ [1+ pq -1)] 


differs from Deft? (Y) only by a factor of (M IM)’, 
although the actual difference can be much more 
pronounced as we have showed in this paper (e.g., 
expressions (3.7) and (4.23)). 


4.5 More General Cases 


Weighting survey data involves not only sampling 
weights but also various weighting adjustments such as 
post-stratification, raking, and nonresponse compensation. 
We consider these general cases here. 

We can rewrite the first-order Taylor approximation to 
the weighted mean estimator Y Y=Y/M given in (3.2) as 
WaT yi r= (¥ — Vi M)/M. Taking variance 


on both sides, 
CV?(7) =CV2(Y)+CV2(M) 


+2R,(¥,M)CV, (Y)CV, (M1), (4.22) 


where CV, (Y 3 GN (¥ 5 CWE (M ) are the relative vari- 
ances of Y, Y, and M respectively, and R, ns M) isthe 
correlation coefficient of Y and M ei respect to the 
complex sampling design p and any weighting adjustments. 
Since the relative variances of simple sample total and mean 
Vwaidt pata CV2 (Viy= CVA Ge Suey, 


under srswr of size m, it follows from (4.22) that 


Deft* je Deft* (V) 
+2R,(Y,M)V , (y)Deft, (¥)+V2(y), 4.23) 


where V, (y)=CV, (M)/ CV owe (Yes) IS nonnegative. 
As an illustration, consider a binary variable y, where 
CV; =(1-¥)/Y¥ and, thus, V,(y) can be arbitrarily 
large as Y approaches | or small as Y approaches zero 
assuming CV, (M) #0. When W, (y) is near zero, the 
two design effects are nearly equal. Otherwise, one is larger 
than the other depending on the values of V,, (y) and 
Ry (Y, M). When the sampling weights are benchrmanted 
to ie known population size M,Y and Y have the same 
design effect since M =M and CV, (M ) =0. In this case, 
Y is not affected by the Bedehmariane but Y=M Y, 

which is a ratio estimator. Note that poststratification or 
raking procedures may be used if population size infor- 
mation is available at subpopulation level and we also get 
equivalcye design effects. In general, however, we have 
Deft’ X08) 2 Deft’. (Y) if 


V 
R, UY, M)>- dott or 
2 Deft, (Y) 
R,(Y,M) aay (4.24) 
2 CV, (Y) 


and vice versa. 
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It is illuminating to look at some reat situations. For 
example, if R,, (Y, M) > 0, then Deft’, (Y) > Deft; (V), 
however, a noeaiive correlation (i. ahi , M ) <0) 
doesn’t necessarily lead to Deft? > (Y) < Deft? (Y ). For a 
special case of R,, (Y, M)=0, the puntenences is given by 


CV; (M) 
CVE. Ge ) 


Figure 1 shows graphically the relation between the two 
design effects. The expression in (4.23) is plotted for some 
fixed values of R,, (Y,M) and V »(y). The solid line 
passing through Wes origin which pica equal design 
effects is the reference line. As the graphs show, the 
oon is not clear-cut. When R, (Y, M)<0, 
Deft’, (Y ) > Deft? (Y ) for small Deft’, (¥ ) fa the relation 
flips ee as Deft; (Y) grows larger. 

Hansen et al. (1953, Vol. I, pages 338 — 339) indicated 
that R, (Y, M) would often be close to 0. Under this 
ination: ,. expression (4. 25) is also written as Deft’ (Y j= 
Deft’ (Y) + Cv, (M)/CV> (Y)], fom Winch iwe get 
Deft® (Y Nea Deft’, (Y ). This eal case was studied by 
Jang (2001). Howevel: this doesn’t seem necessary as can 
be seen in the following example. 


Deft? (V) — Deft? (Y) = (4.25) 


Example 4.6 To illustrate the relationship between the 
design effects for Y and Y, we used a data set for the 
adults collected from the U.S. Third National Health and 


Design effect for total 


Design effect for mean 


(a) V,(9) =1.0 


Figure 1. 
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Nutrition Examination Survey (NHANES III), which is 
given as a demo file in WesVar version 4.0. NHANES III is 
a nationwide large-scale medical examination survey based 
on a Stratified multistage sampling design, for which the 
Fay’s modified balance repeated replication (BRR) method 
was employed for variance estimation. (See Judkins 1990 
for more details on Fay’s method.) We used only 19,793 
records with complete responses to those characteristics 
listed in Table 1. Note that the weight in the demo file is 
different from the NHANES III final weight that was 
obtained by poststratification. For more detailed information 
on the demo file, see Westat (2001). . 

Table 1 presents the design effects for Y and Y, and 
component terms of (4.23) for the selected characteristics. 
Note that V , (y) monotonically decreases in CV, given 
that m=19, 793 and cv, (M) =3.2%. Although v, , (y) 
tends to be the determinant factor in the difference of the 
design effects, Lie (¥, M) can be important when it is 
negative. For Sane for two race/ethnicity characteristics, 
African American and Hispanic, the negative values, —0.67 
and —0.24 of R,, (Y M ) were responsible for Deft’, (Y)< 
Deft’ (Y). Some design effects for Y are huge. This is not 
the case with the NHANES Ii poststratified final weights, 
with which Y and Y have the same design effect. This 
illustrates the importance of benchmarking weight 
adjustments for total estimates. 


Design effect for total 


Design effect for mean 


(DN A) = 


Plots of Deft; »(Y) versus Deft? 2 (Y) for (a) Ve »(Y) = .0 (b) V p(y) =2.5. The solid line corresponds to 


Deft? 2 P) = Deft? (Y). Ot ae CG (¥, Veeeeaes 0.5, — 0.2, 0, 0.2, 0.5, 0.9, respectively. 
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Table 1 
Comparison of the design effects for the weighted total and mean using a subset of the adult data file from the U.S. 
Third National Health and Nutrition Examination Survey (NHANES III) 


Mean 
‘ 2 
Characteristic Eaene Der “i 
ies smoke a Yes 0.53 4.13 0.014 
cigarettes in life? 
Has diabetes? Yes 0.05 eve 0.040 
No 0.95 ibis 0.002 
Has hypertension/ Yes 0.23 3.42 0.024 
high blood pressure? No 0.77 3.42 0.007 
Race/Ethnicity African American® 0.12 7.64 0.054 
Hispanic* 0.05 6.70 0.079 
Gender Male 0.48 1.40 0.009 
Female 0.52 1.40 0.008 
Number of cigarettes 
= 525 6.42 0.037 


smoked per day 


Population Size = Z: me =e 


Total } 
; aes es cv ,(M 
Estimate Deft" cv oy) AM Vii = = 
2cV p (Y) 
98,397,795 31.31 0.038 0.944 0.20 4.83 —0.58 
9,783,307 1.92 0.042 4.246 —0.34 1.07 —0.31 
176,341,218 393.47 0.033 0.236 0.34 19.35 —5.53 
42,939,866 7.96 0.037 1.826 —0.18 2.50 —0.37 
143,184,660 78.44 0.034 0.548 0.18 832 2497 
21,567,028 4.21 0.040 2.762 —0.67 1.65 —0.11 
9,550,326 6.48 0.078 4.300 —0.24 1.06 —0.08 
88,725,967 19.18 0.033 1.048 —0.11 4.35 —1.55 
97,398,559 25.39 0.034 0.954 0.11 4.77 —1.70 
977,225,826 10.51 0.047 2.044 —0.09 223 —0.17 
186,124,526 - 0.032 — = — _ 


Note: * denotes the cases where the design effect for Y is smaller than that for Y. 


5. CONCLUSION 


We studied the design effects of the two most widely 
used estimators for the population mean and total in sample 
surveys under various with-replacement sampling schemes. 
We do not think the employment of with-replacement 
sampling is necessarily a serious limitation because we can 
see things more clearly without muddling the math with 
probably unnecessary complications with without-replace- 
ment sampling schemes. Furthermore, the effect of the finite 
population correction is largely canceled out in our 
formulation of the design effect and so the results are quite 
comparable with traditional design effects for without- 
replacement sampling. Therefore, our findings should be 
useful in practice. We summarize our key findings below. 

Kish’s well-known approximate formulae for the design 
effect for (ratio type) weighted mean estimators are not 
easily generalized in their form and concepts to more 
general problems, especially weighted total estimators 
contrary to what many people would perceive. In fact, Y 
and Y often have very different design effects unless the 
sampling design is self-weighting or the sampling weights 
are benchmarked to the known population size. In addition, 
the design effect is in general not free from the distribution 
of the study variable even for the mean estimator, let alone 
the total estimator. Furthermore, the correlation of the study 
variable with the weights used in estimation can be an 
important factor in determining the design effect. Therefore, 
apart from its original intention, the design effect measures 
not only the effect of a complex sampling design on a 
particular statistic but also the effects of the distribution of 


the study variable and its relations to the sampling design on 
the statistic. As complex survey software packages routinely 
produce the design effect, it seems appropriate to warn the 
user of the packages of these rather obscure facts about the 
design effect. 
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Robust Generalized Regression Estimation 


JEAN-FRANCOIS BEAUMONT and ASMA ALAVI' 


ABSTRACT 


The Best Linear Unbiased (BLU) estimator (or predictor) of a population total is based on the following two assumptions: 1) 
the estimation model underlying the BLU estimator is correctly specified and ii) the sampling design is ignorable with 
respect to the estimation model. In this context, an estimator is robust if it stays close to the BLU estimator when both 
assumptions hold and if it keeps good properties when one or both assumptions are not fully satisfied. Robustness with 
respect to deviations from assumption (1) is called model robustness while robustness with respect to deviations from 
assumption (ii) is called design robustness. The Generalized Regression (GREG) estimator is often viewed as being robust 
since its property of being Asymptotically Design Unbiased (ADU) is not dependent on assumptions (i) and (ii). However, 
if both assumptions hold, the GREG estimator may be far less efficient than the BLU estimator and, in that sense, it is not 
robust. The relative inefficiency of the GREG estimator as compared to the BLU estimator is caused by widely dispersed 
design weights. To obtain a design-robust estimator, we thus propose a compromise between the GREG and the BLU 
estimators. This compromise also provides some protection against deviations from assumption (i). However, it does not 
offer any protection against outliers, which can be viewed as a consequence of a model misspecification. To deal with 
outliers, we use the weighted generalized M-estimation technique to reduce the influence of units with large weighted 
population residuals. We propose two practical ways of implementing M-estimators for multipurpose surveys; either the 
weights of influential units are modified and a calibration approach is used to obtain a single set of robust estimation weights 
or the values of influential units are modified. Some properties of the proposed approach are evaluated in a simulation study 
using a skewed finite population created from real survey data. 


KEY WORDS: Design robustness; Model robustness; M-estimator; Outliers; Shrunk weights; Best linear unbiased 


predictor. 


1. INTRODUCTION 


In classical theory, sample data can be viewed as being 
randomly drawn from an infinite population and assump- 
tions are made about the unknown distribution of the infinite 
population. In other words, a model is postulated and the 
interest lies in the estimation of model parameters. In this 
context, an estimator 6 of a model parameter 9 is robust if 
it stays close to the maximum likelihood estimator of 0 
when the model assumptions hold and if it keeps good 
properties when the model assumptions are not fully satis- 
fied. The unknown distribution of the infinite population is 
often assumed to be the normal distribution and, as a result, 
the maximum likelihood estimator reduces to the usual 
least-squares estimator. 

The presence of outliers in the sample can be viewed as a 
consequence of a deviation from a model assumption. The 
majority of the sample could be assumed to come from the 
selected model but some units, called outliers, could be 
thought of as coming from a different model. Therefore, the 
presence of such outliers in the sample may introduce bias 
and increase the variance of the least-squares estimator of 
the selected model parameters. Outliers could also be the 
consequence of a highly skewed distribution. In this case, 
the least-squares estimator is not biased but may be highly 


inefficient due to a deviation from the usual normality 
assumption. The presence of outliers in the sample could 
also be the result of measurement errors. However, it is 
assumed in the rest of this paper that the data have been 
verified and corrected, if necessary, and that there is no 
measurement error left in the data. Outlier-robust estimation 
for infinite populations has been studied extensively (for a 
review, see Huber 1981; or Hampel, Ronchetti, Rousseeuw 
and Stahel 1986). 

In survey sampling theory, the interest usually lies in the 
estimation of finite population parameters such as the total, 
t, = Dey Y,> Of a variable of interest y for a finite popu- 
lation U of size N. Because it is usually not possible to 
observe the variable y for all population units, the usual 
practice consists of selecting from the finite population a 
random sample s of size n according to some probability 
sampling design p(s |Z) . The matrix of design information 
Z contains N rows with its k" row equal to z;,, and z is a 
vector of auxiliary variables available at the design stage. 
This does not preclude the finite population itself to be 
assumed to come from a model, as it is explicitly the case 
when it is chosen to make model-based inferences. Under 
this type of inference, Royall (1976) derived the Best Linear 
Unbiased (BLU) estimator (or predictor) C of t, (see also 
Valliant, Dorfman and Royall 2000, Chapter 4 Tt is based 
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on the following two assumptions: 1) the estimation model 
underlying the BLU estimator ¢” is correctly specified and 
ii) the sampling design is ignorable with respect to the 
estimation model. In this context, an estimator f, of the 
finite population total ¢, is robust if it stays close to the 
BLU estimator 7? when both assumptions hold and if it 
keeps good properties when one or both assumptions are not 
fully satisfied. Robustness with respect to deviations from 
assumption (i) is called model robustness while robustness 
with respect to deviations from assumption (ii) is called 
design robustness. 

Although we consider robust estimators that are con- 
structed from a model-based viewpoint, we prefer eval- 
uating their properties as much as possible with respect to 
the sampling design. This allows us to choose the constants 
on which robust estimators depend and to evaluate their 
quality without having to rely on a model and, more 
specifically, without having to rely on a model for the 
outliers. This also provides an objective framework for 
comparing estimators derived under different models. This 
preference of evaluating properties of model-based esti- 
mators with respect to the sampling design is also shared by 
Little (1983) who notes that design-based asymptotics may 
be more useful for assessing estimators than model-based 
asymptotics, particularly when the data set is large. 

The Generalized Regression (GREG) estimator of f, is 
often viewed as being robust since its property of being 
Asymptotically Design Unbiased (ADU) is not dependent 
on assumptions (i) and (ii); that is, the GREG estimator is 
bias-robust even though its form can be justified by an 
estimation model. However, if both assumptions hold, the 
GREG estimator may be far less efficient than the BLU 
estimator and, in that sense, it is not robust. The relative 
inefficiency of the GREG estimator as compared to the 
BLU estimator is caused by widely dispersed design 
weights. The fact that variable design weights may increase 
the variance of an estimator is well known (see, for 
example, Rao 1966; DuMouchel and Duncan 1983; Kish 
1992; Pfeffermann 1993; Korn and Graubard 1999, Chapter 
4; Elliott and Little 2000; and Kalton and Flores-Cervantes 
2003) and is not uncommon in household surveys due to the 
presence of many weight adjustments before calibration 
(Kish 1992; and Kalton and Flores-Cervantes 2003). This 
problem is often treated by truncating the larger design 
weights (Potter 1988, 1990, 1993; and Stokes 1990). 

To obtain a design-robust estimator when the design 
weights are highly variable, we propose a compromise be- 
tween the GREG and the BLU estimators based on the 
weighted Least-Squares (LS) technique. This compromise 
estimator has a smaller design bias than the BLU estimator 
when the ignorability assumption is not satisfied and, at the 
same time, is more efficient than the GREG estimator when 


this assumption holds. It also provides some protection 
against deviations from model assumptions. Balanced 
sampling (Royall and Herson 1973) and nonparametric 
calibration (Chambers, Dorfman and Wehrly 1993) are 
other methods that provide protection against certain types 
of model misspecifications (see also Valliant, Dorfman and 
Royall 2000, Chapter 3, 4 and 11). However, none of these 
methods offer any protection against outliers, which can be 
viewed as a consequence of a model misspecification. In a 
model-based framework, the idea underlying the 
M-estimation technique has been proposed to develop 
outlier-robust alternatives to the BLU estimator (Chambers 
1986; Lee 1991; and Welsh and Ronchetti 1998). In a 
design-based framework, the M-estimation technique has 
also been used to develop outlier-robust alternatives to the 
GREG estimator (Gwet and Rivest 1992; Hulliger 1995 
1999; Duchesne 1999; and Zaslavsky, Schenker and Belin 
2001). M-estimation is also discussed in the review paper by 
Lee (1995) and an empirical comparison of several outlier- 
robust estimators can be found in Gwet and Lee (2000). 

Finite population parameters are often very sensitive to 
the presence of outliers in the population. This is to be 
contrasted to model (infinite population) parameters, which 
are usually insensitive to outliers. The problem of outlier 
robustness is therefore different for finite and infinite pop- 
ulations. As noted in Chambers (1986), it is the sampling 
error (or the prediction error in a model-based framework) 
of an estimator which must be insensitive to outliers in finite 
populations and not necessarily the estimator itself. For 
instance, when a simple random sampling design is used, 
the sample median is robust in the classical sense. As a 
result, its design variance is essentially unaffected by the 
presence of an outlier in the finite population, no matter how 
large is that outlier. However, the sampling error and the 
design bias of the sample median, when used as an 
estimator of the finite population mean, take an arbitrarily 
large value when one or more population unit takes an 
arbitrarily large value. This is explained by the fact that the 
finite population mean itself takes an arbitrarily large value 
in such a case. Unlike the sample median, the sample mean 
is design unbiased but it is not robust in the classical sense. 
The sampling error and the design variance of the sample 
mean can thus be very affected by the presence of an outlier 
in the finite population. This illustrates why outlier-ro- 
bustness for finite populations is often viewed as a trade-off 
between bias and variance and why outliers must usually 
have an influence, at least to some extent, on estimators. 
The Mean Squared Error (MSE) is therefore a useful cri- 
terion for evaluating the quality of outlier-robust estimators 
of finite population parameters. 

The real goal of this paper is to find a robust alternative 
to the commonly-used GREG estimator of ¢, . However, it 
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is more natural to discuss robustness issues by first intro- 
ducing the optimal (BLU) estimator. Therefore, the assump- 
tions underlying the BLU estimator are discussed in section 
2. We also give additional conditions under which the BLU 
estimator has a negligible asymptotic design bias. Section 3 
deals with design robustness and the weighted LS estimator 
is introduced. In section 4, model robustness (more speci- 
fically, outlier robustness) is discussed and the weighted 
generalized M-estimation technique is suggested to reduce 
the influence of units with large weighted population 
residuals. The proposed estimator is census-consistent in the 
sense that it is equal to the finite population total t, when a 
census is conducted. We propose two practical ways of 
implementing M-estimators for multipurpose surveys; either 
the weights of influential units are modified and a cali- 
bration approach is used to obtain a single set of robust 
estimation weights or the values of influential units are 
modified. Mean Squared Error (MSE) estimation is dis- 
cussed in section 5. In section 6, some properties of the 
proposed approach are evaluated in a simulation study using 
a skewed finite population created from real survey data. 
Finally, some concluding remarks are made in the last 
section. 


2. THE BEST LINEAR UNBIASED ESTIMATOR 


Let us assume that we have a vector of auxiliary 
variables x available for all units of the sample s and for 
which population totals, t, =) ,<) X,, are known. Let us 
also denote by X, the matrix containing N rows with its a 
row equal to x, . The vector x may or may not contain 
some variables in the vector z of design variables. Before 
discussing robustness, we first describe the two assumptions 
(see Al and A2 below) with respect to which robustness is 
desired. Then, we briefly explain how to validate them. 


A1) The following estimation model m holds: y, given X, 
for ke U , are independently distributed with mean 
E,,(y,|X)=x;,B and variance V,(y,|X)=07v,, 
where B and o* are unknown model parameters, 
v, =x, and 2% is a vector of known constants. The 
subscript “m’’ indicates that expectations and variances 
are evaluated with respect to model m. 

A2) The sampling design is independent of y after condi- 
tioning on X; that is, p(s|y,X)= p(s|X), where y is 
a vector containing N elements with its k" element 
equal to y,. 


Assumption (Al) describes the estimation model m, 
which specifies the distribution of y conditional on X. 
Standard techniques can be used to validate this model (see, 
for example, Draper and Smith 1980, Chapter 3). The 
linearity assumption E,,(y,|X)=x;,B is an important 


AOT 


assumption underlying the estimation model m. There are 
many ways of assessing the validity of this assumption. A 
graph of residuals e, = y, —x/p versus xB, for some 
m-unbiased estimator p of B, is often suggested for this 
purpose. Any trend in this graph is an indication that the 
relationship between y and x is not linear. To obtain ro- 
bustness against a deviation from the linearity assumption, a 
poststratification model can be used when it is possible to 
partition the population into homogeneous and mutually 
exclusive groups. An example of the importance of careful 
modeling in sample surveys can be found in Hedlin, Falvey, 
Chambers and Kokic (2001). 

Assumption (A2) is a sufficient condition for the 
ignorability (Rubin 1976) of the sampling design with 
respect to the distribution of y conditional on X. In other 
words, it means that the distribution of y is independent of s 
after conditioning on X. Using assumption (A1), y can be 
split into a fixed term Xf and a random error term 
¢=y-— Xf. Consequently, if the sampling design is inde- 
pendent of e after conditioning on X; that is, if 
p(s |s, X) = p(s|X), then assumption (A2) is satisfied and 
the sampling design is ignorable. Since we only consider 
sampling designs of the form p(s|Z), an obvious way to 
make the sampling design ignorable is achieved by 
including all design variables z into the estimation model. 
Examples of such design variables may include the 
variables used to form the strata, the variable used as a size 
measure if probability-proportional-to-size sampling is used 
and so on. The design weights may also provide a useful 
summary of the design information. Note that it may not be 
necessary to include all design variables into the estimation 
model (see Sugden and Smith 1984). Design variables that 
are independent of y (or €) after conditioning on X should 
not be included. To assess the validity of assumption (A2), a 
graph of the residuals, e, = y, — xB , versus design weights 
w, (or any design variable) may be useful (see Pfeffermann 
1993). Any trend in this graph suggests that the design 
weights are correlated with the random error ¢ and that the 
sampling design is not ignorable with respect to the 
estimation model. More formal tests can also be performed 
to assess the validity of this assumption (see, for example, 
DuMouchel and Duncan 1983; Graubard and Korn 1993; 
and, for more references on this topic, Pfeffermann 1993). 

Under the estimation model m and the ignorability 
assumption (A2), it is easy to show that the BLU estimator 
(Royall 1976) ¢” of t, takes the simple projection form 
oe = B® , where B® is implicitly defined by the equation 

>, — xB")  =0. (2.1) 
kes k 


The BLU estimator can also be written as f” = 


Dies W, Y,, Where the BLU estimation weights we are 
given by 
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, , =z 
w! -4i[5 544) t. 02) 


Vy kes Vy 


The model variance V,,{(f? —1,)|s,X} of 7? is the 
smallest for every possible sample among all linear m- 
unbiased estimators of ¢,. A direct consequence of this 
result is that the anticipated variance E,,{ E wis a pk |X} 
of 7” is also the smallest among all linear m-unbiased esti- 
mators of t,, where the subscript p indicates that the 
expectation is evaluated with respect to the sampling design. 
Under the additional assumption that y, given X follows a 
normal distribution, B® is also the maximum likelihood 
estimator of the vector of model parameters B . 

In general, the BLU estimator te is not ADU. However, 
under the estimation model m, the ignorability assumption 
(A2) and the following additional assumption (A3), the 
BLU estimator has the property of being Asymptotically 
Design Unbiased in Probability (ADUP) in the sense that its 
relative design bias E a —t,)/t, converges in probabil- 
ity to 0 as n and N increase without bound. 

A3) Yiev E, {wy L,}0, =O(N), Lee % B= O(N) and 
Dieu. Oy, = O(N), where o, =o'y, and J, is adum- 
my random variable indicating whether unit k is 
selected in the sample (7, =1) or not (J, =0). 


Assumption (A3) describes the asymptotic behaviour of 
three population quantities. In particular, requiring that 
rey E,{(we) Ii} SO; = O(N) essentially means that none 
of the BLU estimation weights becomes too large as the 
sample size and the population size increase. For instance, if 
X, =v, =1 and if a sampling design of fixed size n is used, 
then condition ><, E (wy al i or = O(N) is equivalent 
to assuming that the weights we = N/n remain bounded as 
both n and N grow. The proof that ie is ADUP is given in 
the appendix and does not require that v,=x,A. As a 
result, the BLU estimator is ADUP even when the model 
variance V,,(y, | X) is misspecified. 

As pointed out above, the BLU estimator is efficient 
when the estimation model m and the normality assumption 
hold as well as the ignorability assumption (A2). Under 
these assumptions and the additional assumption (A3), the 
BLU estimator is also ADUP. Consequently, a first step 
towards robustness consists of selecting and validating an 
estimation model such that these assumptions are satisfied 
as much as possible. However, they are rarely fully satisfied 
in practice. For example, one can be reluctant to include all 
strata identifiers into the estimation model when the number 
of strata is very large. In such a case, the ignorability as- 
sumption might not fully hold. Also, the estimation model, 
including the normality assumption, may not hold for every 
variable of interest. Consequently, the non-critical use of the 
BLU estimator ¢” of t, is not always appropriate and 
robust estimators may be needed. 


3. DESIGN ROBUSTNESS 


Using the fact that v, =x;,2, it can be easily shown (see 
Sarmndal, Swensson and Wretman, 1992, page 231) that f, 
can be expressed as ¢, =t,B, where B is implicitly defined 
by the equation 


> (9%, —x,B) ~* =0. 


keU Vi 


(3.1) 


The vector B would be the LS estimator of B , under the 
estimation model m, if a census could be conducted. Since 
t, is known, the objective of finding an estimator of the 
population total ¢, is thus equivalent to finding an estimator 
of B. In the design-based theory, a natural estimator Bo_of 


B is implicitly defined by the equation 


pes D, « 
PACT —x, Bo) i =0, 


kes k 


(2) 


where w,, the design weight of unit k, equals to the inverse 
of the selection probability 7, . The use of B® leads to the 
GREG estimator *° =t{B° of 1,. The GREG estimator 
f° takes a simple projection form because v, =x,4 (see 
Sarndal et al. 1992, page 231). It can also be written as 
ee = es we y, » where the GREG estimation weights we 
are given by 


; ANe! 
We =, aps W, sa te. (3.3) 
kes Ve 

As pointed out in the introduction, the GREG estimator is 
bias-robust since its property of being ADU is not 
dependent on the validity of the estimation model m and the 
ignorability assumption. However, the GREG estimator is 
not variance-robust since it may be far less efficient than the 
BLU estimator when both assumptions hold. The ineffi- 
ciency of the GREG estimator is due to widely dispersed 
design weights. In household surveys, this situation is not 
uncommon because of many weight adjustments before 
calibration. Also, practical considerations for the choice of a 
sampling design combined with limited information avail- 
able at the design stage often lead to sampling designs that 
are approximately ignorable. In household surveys, for in- 
stance, geographic information is often the main auxiliary 
information available to construct the strata. Unless the 
number of strata is very large, such information is usually 
weakly correlated with quantitative variables of interest, 
such as expenditures or income, and their corresponding 
population residual variable E = y—x’B. As a result, the 
design weight variable w is also weakly correlated with E. 
This suggests that the ignorability assumption may 
approximately hold. This also suggests that the design 
weights act more or less as a random noise when estimating 
B using (3.2) and that their influence could be significantly 
reduced. To obtain a design-robust estimator when the 
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design weights are highly variable, we thus propose to 

shrink the design weights towards their mean and to use the 

LS estimator f Le = en , where B'* is implicitly defined 
by 

= pe x 

Ler -x,B™) — 


kes k 


=0 (3.4) 


and where w, is the shrunk weight of unit k given by 


> 


— kes 


"VY gw, 5a) 


kes 


g(w, 3a). (3.5) 


The reason for the ratio in the right side of (3.5) is simply 
to ensure that },.,W, = Xe, Ww, and the role of the function 
g(w,;a) is to obtain shrunk weights w, that are less 
variable than the design weights w,. This function is 
assumed to be monotone in the constant oa, with 
l< g(w,;a)<w,. The BLU and GREG estimators are 
therefore extreme special cases of the LS estimator obtained 
when a is such that g(w,;a)=1 and g(w,;a)=w, 
respectively. To obtain a simple compromise between these 
two extreme estimators, we suggest using g(w,;0)=w, , 
with O<a<l. The choice a=0 leads to the BLU 
estimator while the choice a=1 leads to the GREG 
estimator. In fact, this suggestion was proposed by Kish 
(1992, page 198). Other functions g(w,;a) and other ways 
of reducing the variability of design weights can be found in 
the literature (see, for example, Elliott and Little 2000). 
Truncating large design weights ( g(w,;a) = min(w,,@) , 
with «@>0O) is a common approach that deals with this 
problem. This approach may be useful when assumptions 
(Al) and (A2) are not fully satisfied and when there are 
some abnormally large design weights. A better approach 
may be to truncate large weighted residuals. The weighted 
generalized M-estimation technique discussed in the next 
section can be used for this purpose. 

The LS estimator 7¢}* can also be written as 
‘a = res wey , > Where the LS estimation weights ee 
are given by 


(3.6) 
kes Vy 

Note that the estimation weights w;*, including w? and 
wy as special cases, are calibrated on the known population 
totals t, in the sense that they satisfy the calibration 
equation Y,.,w,X, =t, (see Deville and Sarndal 1992). 


4. MODEL (OUTLIER) ROBUSTNESS 


As pointed out in the introduction, the LS estimator 7}* 
provides some protection against deviations from the 
ignorability assumption and also against deviations from 
model assumptions. However, it does not offer any 


no 


protection against outliers, which can be viewed as a cones- 
quence of a model misspecification, including a deviation 
from the normality assumption. For instance, the GREG 
estimator is ADU no matter the validity of the estimation 
model. However, its design variance may be very large in 
the presence of outliers in the finite population because they 
may greatly influence its sampling error when they are 
selected in the sample. This problem may be amplified 
when the design weights are widely dispersed. For the 
Horvitz-Thompson estimator, this was well illustrated in the 
circus example of Basu (1971). Of course, the use of effi- 
cient auxiliary variables at the estimation stage can control 
the impact of outliers on estimates. However, such auxiliary 
variables are often not available and outlier-robust esti- 
mators may provide significant gains over the LS estimator. 

Using the Taylor linearization technique (see, for 
example, Sarndal et al. 1992, page 235) and given that 
ty = B, it is well known and easy to show that the 
sampling error of the GREG estimator can be approximated 
as follows: ¢° —t, ~ Yie,W,E, , where E, = y, —x,B is 
the population residual for unit k. As a result, a large design 
weight associated with a large population residual (or 
outlier) may have a substantial impact on the quality of the 
GREG estimator. Moreover, it is straightforward to show 
that the sampling error of the LS estimator can be expressed 
as igh -t,= Dres WE, . Therefore, a large estimation 
weight associated with a large population residual may 
greatly influence the sampling error and the quality of the 
LS estimator. To deal with this problem, we use the 
Schweppe version (Hampel ef al. 1986, pages 315 — 316) of 
the weighted generalized M-estimation technique to reduce 
the influence of units with large weighted population 
residuals. This leads to the M-estimator B™ of B, which is 
implicitly defined by 


RM 
a ed a (4.1) 


kes h, Q ae 


where E, (BM) =(y, ~x,B™)//y, , Q is a positive 
population scale parameter and h, is a weight that may 
depend not only on x, but also on z,. The role of the 
function w(.) consists of reducing the influence of units 
with a large h,E, (B). From the above considerations, 
lg= we 0p arly 23th A is a natural choice. In the 
former case, the influence of large w;,°E, is reduced while, 
in the latter case, the influence of large w,E, is reduced. 
The choice h, = wily, may be preferred to h, = i a[Vp 
when there are outliers in the auxiliary variables x or when 
a is not close to 1 (assuming g(w,;0a)=w; ). The main 
point here is that h, should depend on survey weights we 
or w, and that both choices suggested above should 
perform better than simpler choices that do not take into 
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account the auxiliary variables z such as h,=,/v, or 
h, =1, which reduce the influence of large unweighted 
residuals. Also, it should again be noted that the interest is in 
finding a robust estimator for the vector of population 
parameters B and not for the vector of model parameters B . 
In fact, B is itself not robust (in the classical sense) for B 
since it may be highly affected by the presence of outliers in 
the finite population. As a result, outliers must have a 
certain influence on B™ . 

Equation (4.1) can be written in the weighted linear 
regression form: 


> (BM) (y, -x,BM)=0, 42) 
kes Vy 
where 
W, (B Q)=w, w(x, ) 
I 
and 
h,E, (B™) 
r, =. 
Q 


We propose the following modification of the popular 
function y(.) of Huber (1964) that makes the adjusted 
weights wv, (BM ,Q) always greater than or equal to 1: 
win SRY iE PS oF vand’ “wo sign) 
max(|7,|/w,, @), otherwise, where @ is a positive 
constant. This leads to adjusted weights 


W, if Inj<o, 
3 (B™,Q) = 
MAC Aes oa pairs ing otherwise. (4.3) 
Vk 
The Iteratively Reweighted Least-Squares (IRLS) 


algorithm (Beaton and Tukey 1974) is often used to solve 
(4.2) and (4.3). At a given iteration 7, the adjusted weights 
Welles Oca) are first calculated using (4.3) and then 
B\ is obtained by solving (4.2) with 7, (BM,Q) and BM 
replaced by ,(B“,Q") and B" respectively. To 
obtain B“, an estimate of Q is usually calculated at each 
iteration of the IRLS algorithm. In the simulation study of 
section 6, we have used 


Oo =1.483 


x weighted sample 


median of (|h, 2, (B"-” ske sh, (4.4) 


where the weighted sample median is calculated using the 
weights w, /h, . Equation (4.4) reduces to the proposal of 
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Hulliger (1999) when h, =1 and g(w,;a)=w, . We sug- 
gest using B = B'S as the vector of starting values since 
Bevis easy to obtain. The iterative procedure is normally 
repeated until convergence is reached. To reduce computer 
time, especially if a resampling method is used for MSE 
estimation, a single iteration of the IRLS algorithm can be 
performed. In section 6, it is shown empirically that per- 
forming a single iteration yields an estimator of the popu- 
lation total that has properties similar to the fully-iterated 
estimator. This point has also been noted my Kee (iggy: 

The M-estimator of f, is given by 8 =t, ‘B™ . With the 
restriction that w, (B™, 0) af where 0 is an estimator of 
Q, the estimators B™ and are census-consistent in the 
sense that they are exactly Stell to B and ft, respectively, 
no matter the value of @ and a, when a census is con- 
ducted (x, =1, for k € U ). This restriction might be useful 
for controlling the design bias of i when there are shrunk 
weights w, close to 1. Note that the estimators BM and 
ae reduce to B'* and ae res ey when @= cc 
Sas =1, ia The M-estimator pe can also be Ba as 

DES wie y,, Where the Wee tation weights wy are 
hen by 


a 
wit = iw (BM, O)** “(5 % (BM, O) math t.. (45) 
Ve 


kes Vy 


The estimation weights w;’ are still calibrated on the 
known population totals t, (X,c,wy X, =t, )- 

In order to determine apprOnn Ae wales for a and 9, 
the MSE of the M-estimator ia can be estimated for 
different choices of a and @ using past or current sample 
data. Then, the values of a and @ that give the smallest 
estimated MSE can be chosen. Estimation of MSE is dis- 
cussed in section 5. As noted in Hulliger (1995), choosing 
adaptively « and @ by minimizing the estimated MSE 
with current sample data leads to an estimator f o that does 
not require estimating the scale parameter Q. Also, this 
procedure controls the magnitude of the design bias of ia 
without requiring the use of additional constants. However, 
it is likely to provide less efficiency than using the optimal 
(although unknown) values of @ and . 

In multipurpose surveys, different values of a and @ are 
likely to be obtained for different variables of interest. If 
multiple sets of weights are to be avoided, some form of 
compromise is needed. As a first step towards a compro- 
mise, a common value of a, satisfactory for the most 
important variables of interest, can be determined. Then, we 
propose two practical ways of implementing the 
M-estimator ¢ “ without having to find a compromise value 
for @; either the weights of influential units are modified 
and a calibration approach is used to obtain a single set of 
robust estimation weights or the values of influential units 
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are modified. The former is discussed in section 4.1 while 
the latter is discussed in section 4.2. 


4.1 Modification of the Weights of Influential Units 


Let us now assume that it is desired to estimate the 
population totals of a vector of g variables of interest 
ea Ore he a + Yq) - A vector of g M-estimators tt = 
Ga : cae wo fy Of ty =Zyey Y~ can be obtained, il 
potentially ae ies of @ for different variables. To 
simplify the notation, we denote the adjusted weights 
associated with variable y, by ,(y,), for i=1,2,....¢ 
Since the adjusted weights w,(y,) depend on the variable 
of interest y,, we obtain g sets of weights, even if a 
common value of @ is chosen. 

Gwet and Rivest (1992), Duchesne (1999) and Hulliger 
(1999) suggested using the adjusted weights ,(y)= 
min(W,(y,), We (2). (y,)) to obtain a unique set of 
weights. Then, estimation weights w/’ (y) are calculated by 
replacing wv, (BM, O) by Ww, (y) in (4.5) and t, is 
estimated by Y,<,w;'(y)y,- Although the estimation 
weights w,"(y) are calibrated on the known population 
Aaa t, , they are not calibrated on the vector of estimates 

Pahich are believed to be our best estimates in the sense 
" minimizing the estimated MSE. Moreover, the use of 
Dies Wi’ (y) y,, likely leads to a larger design bias than ty" 
although it controls the design variance. To cope with these 
issues, we propose computing the estimation weights 
w"4(y) by replacing #,(B™,Q) by the adjusted weights 
Ww, (y) in (4.5), and by augmenting the vector of auxiliary 
variables x and the known population totals t, using y 
ane fH respectively. As a result, the estimation weights 
macy) are calibrated on t, and t™ , and t, is estimated 
. te ae Mela Mer Of sertes oe may be a limit 
on the number of variables that can be used for calibration 
purposes. This may somewhat restrict the applicability of 
this method when gq is very large. 


4.2 Modification of the Values of Influential Units 


Another way of implementing the M-estimator t™ in 
practice consists of modifying the values of the variables of 
interest y and using the LS estimation weights w;> for all 
variables. This can be done separately for each variable of 
interest, so we return to the case of only one variable of 
interest in this section. 

Let us first denote by s, the random set of all sample 
units k for which 7, (B™,Q) # %,. In other words, s, is 
the random set of units that have been detected as being 
influential. Let also B™” be implicitly defined by the 
equation 


A r>Oome. & 
> (4, —X, BM)  =0, 


kes Vy 


(4.6) 
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where y,, = y,,if ke s—s,,and y,, = y,, otherwise. The 
quantity y, is a modified value for the influential unit k that 
is used to replace y, . Note that BM = BS if Yo. =,» for 
kes. The population total t+, can then be estimated by 
iM" = BM. It is also easy to show that 7” = 
5 0n Vek : | 

The idea here consists of finding modified values _y, , for 
kes,, as close as possible to the original values y, and 
that satisfy the constraint BY =B™_ Under this constraint, 
it is obvious that on went . A possible implementation of 
this idea is outa by minimizing the distance function 
Des, W.(¥, —y,) /v, subject to the constraint BM — 
B™. This leads to the modified values 


Le 
* , W, , 
py gock | Fa, 


kes, "k 


SF, | UB ). 


kes Vy 


(4.7) 


This idea is essentially equivalent to reverse calibration 
proposed by Ren and Chambers (2002), except that these 
authors used the constraint ¢”" =7” instead of BM’ =B™. 
We prefer the latter since it leads to modified values that 
better preserve the relationships between the variable of 
interest y and the auxiliary variables x. 

Other ways of determining modified values that satisfy 
the constraint BM“ = B™ can be found. For example, it is 
straightforward to show that this constraint is satisfied when 
the following modified values are used: 

Vp = AY, +(1-a,)x,B™ , (4.8) 
where a, =W, (BM O)/ w,. The modified values in 
equation (4.8) have a simple I ye sey are a 
weighted average of the robust prediction x, ‘B™ and the 
observed value y,. Less weight is given to — observed 
value y, when it has a smaller value of a, and, therefore, 
when it is highly influential. 


5. MEAN SQUARED ERROR ESTIMATION 


Estimation of the MSE of ee can be used for three 
different purposes: 1) finding appropriate values for a and 
@ using past or current sample data, ii) evaluating the qual- 
ity of estimates and iii) making inferences about unknown 
population quantities. Using the fact that E aoe pars, I 
can be easily shown that the MSE of 7” can be approx- 


imated by 
MSE CES )=V, Ce ) 


+E 6" -i7y -V,@" -t7). 6. 
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The last two terms of (5.1) are equal to [E,, ae “i, a 
They represent the square of the design bts of js . AS 
suggested 1 mw Gwet and Rivest (1992), a potential estiniatat 


of MSE, ee ) 1s given by 
mse , s(oy ine (i ) 

ee ; Ce -i7)’ -V,(0" =i HKp Oy) 
where WV ye and V, ce —1° ) are estimators of 
ihe Game ) pat - (ty ft, is Rcpocively 


Site the Beige come has a complex structure, re- 
sampling variance estimation methods Aa a convenient 
way of estimating vere ) and V, iu —1' ©). The jack- 
knife, the bootstrap and the balanced ropesied replications 
methods are described and evaluated in Rao, Wu and Yue 
(1992) for stratified multistage sampling designs, where the 
primary sampling units are assumed to have been selected 
with replacement. They have shown in an empirical study 
that the jackknife variance estimator can have a large bias 
when estimating the variance of a non-smooth estimator, 
such as the sample median. Therefore, the jackknife vari- 
ance estimator might be more biased for estimating the vari- 
ance of the M-estimator than the balanced repeated rep- 
lication or the bootstrap method when, at each iteration of 
the IRLS algorithm, Q is estimated using a non-smooth 
estimator such as (4.4). Gwet and Lee (2000) studied empir- 
ically the performance of the jackknife and the bootstrap 
methods for some robust estimators. In general, they found 
encouraging results. It is important to note that the estimator 
¢” should be recomputed for each resample. This includes 
repeating the procedure used to estimate a and 9 if they 
are estimated using current sample data. 

When the goal of MSE estimation is only to find 
appropriate values for a and @, it may be convenient to 
consider simplified MSE estimators in order to reduce 
computer time. We now propose four different ways of 
simplifying MSE estimation: 

i) Only a single iteration of the IRLS algorithm could be 
done for each resample even if a fully-iterated 
M-estimator is used. This might yield reasonable 
variance estimates since the singly-iterated and fully- 
iterated M-estimators seem to have similar properties 
(see section 6.4). 


ii) Some quantities could be assumed fixed (not random) 
for MSE estimation. This is likely to lead to an 
underestimation of the MSE but it may be useful if the 
goal of MSE estimation is only to find appropriate 
values for a and @ . For example, the adjusted weights 
, (BM, QO) could be assumed fixed. This approxima- 
tion was in fact suggested in Hulliger (1999). Alter- 
natively, if the M-estimator is implemented using the 
methodology in section (4.2), the modified values in 


(4.7) or (4.8) could be treated as true values for MSE 
estimation. 

iii) The term V,(#" —77) in (5.2) could be omitted. This 
would lead to the MSE estimator: mse oe as 
V, (0.")+(¢," -£’)° . Note that this approach iieadetp 
an overestimation ae the MSE. 


iv) A combination of two of the above three propositions 
could be considered. For example, the adjusted weights 
Wy (BY O) could be assumed fixed and the term 
V, (i -t, “) in (5.2) could be omitted. In such a case, 
an cece for V,, oe ) could be obtained by noting 
that V, @s v=te v, (B™ )t, and by using the well 
known Ton nedimation technique of Binder (1983) 
to estimate V,, (B™). After some straightforward alge- 
bra, we obtain the MSE estimator 


mse te 


Sy ee Tl de) Wid, =xeBe wi" (y= xB 


kes les Ty kl 
+ -fY)’, (5.3) 
where 7,, is the joint probability of selection of units k 


and /. 


6. SIMULATION STUDY 


We performed a simulation study to evaluate some 
properties of the LS estimator and the M-estimator for a 
skewed finite population. In particular, we compared a 
version of the M-estimator that reduces the influence of 
large weighted population residuals to another one that 
reduces the influence of large unweighted population 
residuals. We also compared the performance of the singly- 
and fully-iterated M-estimators. Section 6.1 describes the 
population and the sampling design, and sections 6.2 to 6.4 
discuss results from the simulation. 


6.1 Population and Sampling Design 


The data from Statistics Canada’s 1998 Survey of 
Household Spending (SHS) are used to serve as the popu- 
lation. This survey uses a stratified multi-stage design and 
contains information about 15,457 households on several 
variables. The variable Renovation/Repair is chosen as the 
variable of interest y. This variable is considered for its 
greater potential of having very large values. A vector x of 
three binary auxiliary variables have been created by 
dividing the variable Income into three categories (Income < 
30,000, 30,000 < Income < 60,000 and Income > 60,000) 
and we have chosen v, =1, for all ke U . In other words, 
we have considered a poststratification estimation model, 
which should give us robustness against deviations from the 
linearity assumption. The population coefficient of 
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determination (R* ) for this estimation model is 0.13. This 
is atypical R* in household surveys. 

From this population, 5,000 samples of expected sample 
size 300 have been selected using Poisson sampling. We 
wanted to give households quite dispersed probabilities of 
selection resulting in variable design weights. We thus 
assigned probabilities of selection such that they were 
proportional to the inverse of the SHS design weights 
(which include a nonresponse adjustment factor). The 
selection probabilities are thus given by 2,= 
(300/ Dicey %,) %,, where 1, , for ke U, is the reciprocal 
of the design weight (including a nonresponse adjustment 
factor) from the SHS data. 

Table 6.1 gives some summary statistics for this pop- 
ulation. We note that the population residuals are very 
skewed and that the skewness increases when the residuals 
are multiplied by the design weights. Figure 6.1 shows a 
graph of the population residuals versus the design weights. 
First, we note that there is a clear outlier with a residual 
greater than 50,000 and with a design weight not close to 1. 
Fortunately, the most extreme design weights are not asso- 
ciated with large population residuals. Also, although this 
graph may be misleading because of the huge number of 
points that are overlapping, there does not seem to be any 
clear relationship between the population residuals and the 
design weights. In fact, the coefficient of correlation be- 
tween the design weights and the population residuals is 
0.0049. Such a small coefficient of correlation is not atyp- 
ical in household surveys, for reasons discussed in section 3, 
and suggests that the ignorability assumption may hold 
approximately. 


Table 6.1 
Summary Statistics about the Population 
Standard 
Variable Mean Deviation Skewness 
Renovation/Repair 367 1,124 12.6 
Population Residual 0 1,104 12.8 
Design Weight wins 170 1.8 
Weighted Population Residual 922 295,685 15.0 
59,000 © i 
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Figure 6.1. Graph of the population residuals versus the design 
weights 
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For each of the 5,000 samples, estimates of the 
population total for the Renovation/Repair variable have 
been calculated for both the LS estimator and two versions 
of the M-estimator; one that reduces the influence of large 
weighted population residuals (h, =w,) and another one 
that reduces the influence of large unweighted population 
residuals (h, =1). For the he sample, the relative error in 
percentage of any estimate a of ¢, is defined as 
A, =100% x (te =6yy/tge The Relative Bias (RB) and the 
Relative Root Mean Squared Error (RRMSE) of any 
estimator a; expressed as a percentage of the population 
total, can thus be estimated by RB = ¥?:°° A,/5,000 and 
RRMSE =,/>>:" A“ /5,000 respectively. Another mea- 
sure of interest is the Maximum Absolute Relative Error 
(MARE) in_ percentage given by MARE= 
max(|A.'|; %=1, 2, ....0,000). This measure “may be 
useful to assess the sensitivity of an estimator to the 
presence of influential units in the sample. 


6.2 The LS Estimator: Design Robustness 


In this section, we evaluate the properties of the LS 
estimator. Figure 6.2 illustrates the RB, RRMSE and 
MARE of the LS estimator for 11 values of a (a= 0, 0.1, 
0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1) when g(w,;0)=w; . 
On the one hand, the BLU estimator (a=0) has an 
RRMSE close to the minimum and the smallest MARE 
among these 11 values of a but, as expected, leads to the 
largest RB (in absolute value). Its RB is equal to —13.05%, 
which is not negligible. Given that a poststratification model 
is used, this suggests that the ignorability assumption is not 
fully satisfied even though the correlation between the 
design weights and the population residuals is small. On the 
other hand, the GREG estimator (a =1) has a very small 
RB but has the largest RRMSE and MARE due to the 
variability of the design weights. When a=0.2, the LS 
estimator is biased, with an RB of —9.11%, but has a value 
of MARE relatively close to the smallest value and has the 
smallest RRMSE (17.94%) among the values of a 
considered. This is a substantial reduction in comparison 
with the RRMSE of the GREG estimator (34.77%). In 
general, values of a between 0.2 and 0.5 provide a 
reasonable compromise estimator with respect to RB, 
RRMSE, and MARE. Note that, for larger expected sample 
sizes, we expect that the minimum MSE be reached for 
larger values of a because the bias of the LS estimator may 
dominate its variance. 

We have also considered the LS estimator obtained by 
choosing adaptively, for each selected sample, the value of 
a. that leads to the smallest estimated MSE among the set of 
11 values of a considered above. The MSE has been 
estimated using equation (5.3). The average value of a over 
the 5,000 selected samples is 0.43. This is slightly larger 
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than the value of a (0.2) that leads to the smallest MSE (see 
figure 6.2). This may be due to the simplification made to 
obtain (5.3), which omits a component of the square design 
bias when estimating the MSE. Nevertheless, this LS 
estimator shows a significant improvement over the GREG 
estimator in terms of RRMSE (26.05%) and MARE 
(217.99%). This LS estimator shows also a significant 
improvement over the BLU estimator in terms of RB 
(—6.24%). Therefore, it seems that choosing adaptively the 
value of a leads to a useful compromise between the 
GREG and BLU estimators. However, there is a price to 
pay in terms of RRMSE by estimating a instead of using 
the optimal (although unknown) value of o. 
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Figure 6.2. RB, RRMSE and MARE of the LS estimator 


6.3 The M-estimator: Outlier robustness 


We have compared two versions of the M-estimator; one 
that reduces the influence of large weighted population 
residuals (h, =w,) and another one that reduces the 


influence of large unweighted population residuals 
(h, =1). For the weighted version, we chose 7 values of © 
(@ = 10, 25, 50, 100, 150, 200, c¢ ) and for the unweighted 
version, we chose 9 values of @ (@= 2, 5, 8, 11, 14, 17, 20, 
30, cc). We have only considered the case a =1, as we did 
not want to confound the effects of changing the constant o 
with the effect of changing the constant @. Of course, a 
more efficient estimator could be found by an appropriate 
choice of both constants. It is to be noted that the results are 
based on a single iteration of the IRLS algorithm using 
B® =B® as the vector of starting values. 

It can be seen from figures 6.3 and 6.4 that the weighted 
version (h, =w,) has a better potential for reducing the 
RRMSE and the MARE of M-estimators than the 
unweighted version (h, =1). Both graphs of RRMSE 
present a U-shaped curve. The RRMSE curve for h, = w, 
shows that a value of @ between 50 and 150 leads to an 
RRMSE between 25% and about 27%, while the RRMSE 
of the GREG estimator (last point on the graphs) is equal to 
34.77%. The RRMSE curve for h, =1 shows that the 
RRMSE is around 30% for values of @ between 8 and 20. 
In the area where the RRMSE is close to its minimum 
value, the MARE is smaller when h, = w,. This suggests 
that h, =w, may control influential units better than 
h, =1. As expected, the RB in both figures decreases as @ 
increases. 

We have also considered the weighted and unweighted 
versions of the M-estimator obtained by choosing 
adaptively, for each selected sample, the value of @ that 
leads to the smallest estimated MSE (using equation 5.3) 
among the sets of values of @ considered above. The 
average value of @ over the selected samples is 72.34 for 
the weighted version and 10.58 for the unweighted version. 
Calculation of these averages excludes samples for which 
~=co (13 samples for h, =w, and 1 sample for h, =1). 
Both averages are close to the optimal values of @ found in 
figures 6.3 and 6.4 (100 for h, =w,, and 11 for h, =1). 
The weighted version of the M-estimator has an RB of 
—10.24%, RRMSE of 28.07% and MARE of 197.86%. 
The unweighted version of the M-estimator has an RB of 
—8.26%, RRMSE of 28.18% and MARE of 232.57%. 
Therefore, both versions of the M-estimator lead to a 
significant improvement over the GREG estimator in terms 
of RRMSE and MARE at the expense of an increase in RB 
(around —10%). The MARE is smaller for the weighted 
version, which again indicates that it controls influential 
units better than the unweighted version. However, the 
difference in the RRMSE between these two estimators is 
very small. Curiously, it seems that there is no increase in 
MSE due to estimating @ instead of using the optimal value 
when the unweighted version is used. This observation is 
somewhat difficult to explain. 
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Figure 6.3.RB, RRMSE and MARE of the M-estimator when 
h, all 


6.4 Comparison of the Singly-iterated and Fully- 
iterated M-estimators 


We now compare the singly- and_fully-iterated 
M-estimators when o« =1. We only consider the following 
Proecascs:) 1) j= 1. and o=11: and ji) ho=w, and 
© =100. Most of the time, the IRLS algorithm converged 
quickly in the fully-iterated case (average number of 
iterations for convergence is 7.53 for h, =1, and 7.29 for 
h, =w, ), but in some of the 5,000 samples (64 for h, =1, 
and 75 for h, = w, ) it did not converge. When this situation 
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Figure 6.4.RB, RRMSE and MARE of the M-estimator when 
hy, = Wp 


occurred, we kept the M-estimate from the last iteration of 
the IRLS algorithm. From table 6.2, it is evident that the 
RB, RRMSE and MARE of the singly- and fully-iterated 
M-estimators are very close to each other. A point worth 
noting is the slightly smaller RBs for singly-iterated 
M-estimators. This point has also been observed by Lee 
(1991) and is likely due to the fact that we used B“? = BS 
as the vector of starting values for the IRLS algorithm, 
which is ADU for B. 
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Table 6.2 
Comparison of Singly- and Fully-iterated M-estimators 


Singly-iterated 
RB RRMSE 


Estimator 


M-estimator (h, =1, @=11) 


M-estimator( h, = w, 


7. CONCLUSION 


In this paper, we considered robust alternatives to the 
optimal (BLU) estimator. We first proposed a compromise 
between the GREG and BLU estimators, the LS estimator, 
to deal with deviations from the ignorability assumption. 
The LS estimator is obtained by shrinking the design 
weights toward their mean. It is expected to be more stable 
than the GREG estimator when the ignorability assumption 
holds approximately and less biased than the BLU estimator 
when this assumption is not fully satisfied. This was 
confirmed in a simulation study using a population created 
from real survey data. The LS estimator also offers some 
protection against deviations from model assumptions. 

To deal with outliers, we suggested using the weighted 
generalized M-estimation technique to reduce the influence 
of units with large weighted population residuals. We found 
in a simulation study that significant gains in MSE could be 
obtained with this method. We also found that an 
M-estimator obtained using a single iteration of the IRLS al- 
gorithm performed similarly to a fully-iterated M-estimator. 
Finally, we proposed implementing M-estimators for multi- 
purpose surveys by modifying either the weights of influ- 
ential units or their values. We believe that both approaches 
are useful and contribute to bridge a small gap between 
theory and practice. 
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APPENDIX 


In this proof, we remove the conditioning on X when 
taking expectations and variances with respect to model m 
in order to simplify the notation. Using Slutsky’s theorem, 
to show that E , is —t,)/t, converges in probability to 0, 
as the sample size n and the population size N tend to 
infinity, under assumptions (A1), (A2) and (A3), it suffices 
to show that: 


—6.94% 29.28% 


p =100) —8.14% 25.36% 


Fully-iterated 


MARE RB RRMSE MARE 
235.07 G. = 93%" 292976 4 255. Ze 
197.86% -—8.27% 25.33%  196.73% 


a) E,(t,/t,B)=1, /t,B converges in probability to 
1 and 


by” “EE, (t : /t’ B) converges in probability to 1. 


To show (a), note that 


and 


t,, iI 
We jee ee 
an) N (p/n) ia8 keU 2% /M. 


By Chebychev’s inequality, 7+, /t{B converges in 
probability to 1 under model m, as N increases, if 
t’B=O(N) and ¥,<y 6; =O(N) (assumption A3). 

To show (b), we first note that E,,E,(.)= E,E,,(.|s) 
provided that the set of all possible samples does not depend 
on which population was generated by model m. 
Consequently, if assumption aie holds, it is 


straightforward to show that EE a /t’ B)=1. Then, we 


note that 
ie a hi 
Viol ao l= VnEp| so |tE.V,| se]. AD 
t.B t.B t.B 


As a result, VE afte ha 8 a en fe (ee /t’B) since the 
two terms on the aie side of (A.1) are greater than or equal 
to 0. BY the previous inequality and Chebychev’s inequality, 
E, (t” /t.B) converges in probability to 1 under model m, 
as 7? and N increase, if BOD eve ve ia /t’B) =0. Using 
assumption (A2), it is Riiohitorwart to ata that 

lo k 1, Joi / Nw 


Consequently, lim, y..Vnp (ft, /t,B)=0 if t.p= 
O(N) and Yyey E,{(w2)?1,} 6; = O(N) (assumption 
A3). This completes the proof. 


we 1 
1 MELA es YE, 
lr N maine = (ce) 


Survey Methodology, December 2004 


REFERENCES 


BEATON, A.E., and TUKEY, J.W. (1974). The fitting of power 
series, meaning polynomials, illustrated on band-spectroscopic 
data. Technometrics, 16, 147-185. 


BASU, D. (1971). An essay on the logical foundations of survey 
sampling, part 1. In Foundations of statistical inference, (Eds. 
V.P. Godambe and D.A. Sprott), Toronto: Holt, Rinehart, and 
Winston, 203-233. 


BINDER, D.A. (1983). On the variances of asymptotically normal 
estimators from complex surveys. /nternational Statistical Review, 
51, 279-292. 


CHAMBERS, R.L. (1986). Outlier robust finite population 
estimation. Journal of the American Statistical Association, 81, 
1063-1069. 


CHAMBERS, R.L., DORFMAN, A.H. and WEHRLY, T.E. (1993). 
Bias robust estimation in finite populations using nonparametric 
calibration. Journal of the American Statistical Association, 88, 
268-277. 


DEVILLE, J.-C., and SARNDAL, C.-E. (1992). Calibration 
estimators in survey sampling. Journal of the American Statistical 
Association, 87, 376-382. 


DRAPER, N., and SMITH, H. (1980). Applied regression analysis, 
second edition. New-York, John Wiley & Sons, Inc. 


DUCHESNE, P. (1999). Robust calibration estimators. Survey 
Methodology, 25, 43-56. 


DUMOUCHEL, W.H., and DUNCAN, G.J. (1983). Using sample 
survey weights in multiple regression analyses of stratified 
samples. Journal of the American Statistical Association, 78, 535- 
543. 


ELLIOTT, M.R., and LITTLE, R.J.A. (2000). Model-based 
alternatives to trimming survey weights. Journal of Official 
Statistics, 16, 191-209. 


GRAUBARD, B.I., and KORN, E.L. (1993). Hypothesis testing with 
complex survey data: the use of classical quadratic test statistics 
with particular reference to regression problems. Journal of the 
American Statistical Association, 88, 629-641. 


GWET, J.-P., and LEE, H. (2000). An evaluation of outlier-resistant 
procedures in establishment surveys. In The Second International 
Conference on Establishment Surveys, American Statistical 
Association, Alexandria, Virginia, 707-716. 


GWET, J.-P., and RIVEST, L.-P. (1992). Outlier resistant alternatives 
to the ratio estimator. Journal of the American Statistical 
Association, 87, 1174-1182. 


HAMPEL, F.R., RONCHETTI, E.M., ROUSSEEUW, P.J. and 
STAHEL, W.A. (1986). Robust Statistics: the Approach Based on 
Influence Functions. New-York, John Wiley & Sons, Inc. 


HEDLIN, D., FALVEY, H., CHAMBERS, R. and KOKIC, P. 
(2001). Does the model matter for GREG estimation? A business 
survey example. Journal of Official Statistics, 17, 527-544. 


HUBER, P.J. (1964). Robust estimation of a location parameter. 
Annals of Mathematical Statistics, 35, 73-101. 


HUBER, P.J. (1981). Robust Statistics. New-York, John Wiley & 
Sons, Inc. 


HULLIGER, B. (1995). Outlier robust Horvitz-Thompson estimators. 
Survey Methodology, 21, 79-87. 


207 


HULLIGER, B. (1999). Simple and robust estimators for sampling. 
In Proceedings of the Section on Survey Research Methods, 
American Statistical Association, 54-63. 


KALTON, G., and FLORES-CERVANTES, I. (2003). Weighting 
methods. Journal of Official Statistics, 19, 81-97. 


KISH, L. (1992). Weighting for unequal P. Journal of Official 
Statistics, 8, 183-200. 


KORN, E.L., and GRAUBARD, B.I. (1999). Analysis of Health 
Surveys. New-York, John Wiley & Sons, Inc. 


LEE, H. (1991). Model-based estimators that are robust to outliers. In 
Proceedings of the Annual Research Conference, Washington, 
DC, U.S. Bureau of the Census, 178-202. 


LEE, H. (1995). Outliers in business surveys. In Business Survey 
Methods, (Eds. B.G. Cox, D.A. Binder, B.N. Chinnappa, A. 
Christianson, M.J. Colledge and P.S. Kott). Chapter 26, New- 
York, John Wiley & Sons, Inc. 


LITTLE, R.J.A. (1983). Estimating a finite population mean from 
unequal probability sampling. Journal of the American Statistical 
Association, 78, 596-604. 


PFEFFERMANN, D. (1993). The role of sampling weights when 
modeling survey data. International Statistical Review, 61, 317- 
Bon 


POTTER, F. (1988). Survey of procedures to control extreme 
sampling weights. In Proceedings of the Section on Survey 
Research Methods, American Statistical Association, 453-458. 


POTTER, F. (1990). A study of procedures to identify and trim 
extreme sampling weights. In Proceedings of the Section on 
Survey Research Methods, American Statistical Association, 225- 
230. 


POTTER, F. (1993). The effect of weight trimming on nonlinear 
survey estimates. In Proceedings of the Section on Survey 
Research Methods, American Statistical Association, 758-763. 


RAO, J.N.K. (1966). Alternative estimators in PPS sampling for 
multiple characteristics. Sankhya, Series A, 28, 47-60. 


RAO, J.N.K., WU, C.F.J. and YUE, K. (1992). Some recent work on 
resampling methods for complex surveys. Survey Methodology, 
18, 209-217. 


REN, R., and CHAMBERS, R.L. (2002). Outlier robust imputation of 
survey data via reverse calibration. Southampton Statistical 
Sciences Research Institute Methodology Working Paper M03/19, 
University of Southampton. 


ROYALL, R.M. (1976). The linear least-squares prediction approach 
to two-stage sampling. Journal of the American Statistical 
Association, 71, 657-664. 


ROYALL, R.M., and HERSON, J. (1973). Robust estimation in finite 
populations I. Journal of the American Statistical Association, 68, 
880-889. 


RUBIN, D.B. (1976). Inference and missing data. Biometrika, 63, 
581-592. 


SARNDAL, C.-E., SWENSSON, B. and WRETMAN, J.H. (1992). 
Model Assisted Survey Sampling. New-Y ork, Springer-Verlag. 


STOKES, L. (1990). A comparison of truncation and shrinking of 
sampling weights. In Proceedings of the 1990 Annual Research 
Conference, Washington, DC: Bureau of the Census, 463-471. 


SUGDEN, R.A., and SMITH, T.M.F. (1984). Ignorable and 
informative designs in survey sampling inference. Biometrika, 71, 
495-506. 


208 Beaumont and Alavi: Robust Generalized Regression Estimation 


VALLIANT, R., DORFMAN, A. and ROYALL, R.M. (2000). Finite ZASLAVSKY, A.M., SCHENKER, N. and BELIN, T.R. (2001). 
population sampling: a prediction approach. New-York, John Downweighting influential clusters in surveys: application to the 
Wiley & Sons, Inc. 1990 post enumeration survey. Journal of the American Statistical 


WELSH, AH., and RONCHETTI, E. (1998). Bias-calibrated BROO TANCES SOB SER 
estimation from sample surveys containing outliers. Journal of the 
Royal Statistical Society, Series B, 60, 413-428. 


Survey Methodology, December 2004 
Vol. 30, No. 2, pp. 209-218 
Statistics Canada 


209 


Penalized Spline Nonparametric Mixed Models for Inference about a 


Finite Population Mean from Two-Stage Samples 


HUI ZHENG and RODERICK J.A. LITTLE ' 


ABSTRACT 


Samplers often distrust model-based approaches to survey inference because of concerns about misspecification when 
models are applied to large samples from complex populations. We suggest that the model-based paradigm can work very 
successfully in survey settings, provided models are chosen that take into account the sample design and avoid strong 
parametric assumptions. The Horvitz-Thompson (HT) estimator is a simple design-unbiased estimator of the finite 
population total. From a modeling perspective, the HT estimator performs well when the ratios of the outcome values and 
the inclusion probabilities are exchangeable. When this assumption is not met, the HT estimator can be very inefficient. In 
Zheng and Little (2003, 2004) we used penalized splines (p-splines) to model smoothly — varying relationships between the 
outcome and the inclusion probabilities in one-stage probability proportional to size (PPS) samples. We showed that 
p-spline model-based estimators are in general more efficient than the HT estimator, and can provide narrower confidence 
intervals with close to nominal confidence coverage. In this article, we extend this approach to two-stage sampling designs. 
We use a p-spline based mixed model that fits a nonparametric relationship between the primary sampling unit (PSU) means 
and a measure of PSU size, and incorporates random effects to model clustering. For variance estimation we consider the 
empirical Bayes model-based variance, the jackknife and balanced repeated replication (BRR) methods. Simulation studies 
on simulated data and samples drawn from public use microdata in the 1990 census demonstrate gains for the model-based 
p-spline estimator over the HT estimator and linear model-assisted estimators. Simulations also show the variance 
estimation methods yield confidence intervals with satisfactory confidence coverage. Interestingly, these gains can be seen 
for a common equal-probability design, where the first stage selection is PPS and the second stage selection probabilities are 
proportional to the inverse of the first stage inclusion probabilities, and the HT estimator leads to the unweighted mean. In 


situations that most favor the HT estimator, the model-based estimators have comparable efficiency. 


KEY WORDS: Weighting; REML; Empirical Bayes estimation. 


1. INTRODUCTION 


In a sample survey, let y, denote the value of an 
outcome Y for unit 7, and let § denote the set of sampled 
units. The Horvitz-Thompson (HT) estimator (Horvitz and 
Thompson 1952) Yur = Dies ¥;/1,, where x, is the 
probability of selection of unit i, is a design-unbiased 
estimator of the finite population total (and of the mean 
when divided by the known population count WN ). It can also 
be regarded as a model-based projective estimator (Firth and 
Bennett 1998) for the following linear model relating y, to 
i 


y, =Bu, +7,€,, 


where €, is assumed to be i.1.d. normally distributed with 
mean zero and variance o°. 

In Zheng and Little (2003, 2004), we proposed a 
nonparametric model 


y, =f(a,;)+¢€,,€,; ~ ind N(0, 22's’), 


using penalized splines to model mean of outcome y, as a 
smoothly-varying function /f of the inclusion probabilities 
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m,. We showed in Zheng and Little (2003) that the 
nonparametric model-based estimators are more efficient 
than HT for general one-stage probability-proportional-to- 
size (PPS) samples and not much less efficient than HT 
when the data are generated using a model that favors HT. 
In this article we consider two-stage sampling. In the first 
stage, a subset of m primary sampling units (PSUs) is drawn 
from a population with H PSUs with unequal probabilities 
t,,»4=1,...,H. Let us number the included PSUs from 1 
to m. In the second stage, a simple random sample (srs) of 
n, Out of N, secondary sampling units (SSUs) is drawn 
from the sampled PSU labeled h with probability 7, ,. The 
overall selection probability for unit 7 in PSU A is 
TM, =, ,%>,,, and the HT estimator of the mean of an 
outcome Yis y,, = Lier Litt Vai /(%,%2,,)/N, where y,; 
is the value of Y for unit 7 in PSU / and N is the known total 
number of units (SSUs) in the whole population. In a 
commonly adopted design, the first stage selection 
probability is proportional to an estimate of the PSU size, 
and the second stage inclusion probabilities are proportional 
to the inverse of the first stage inclusion probabilities so that 
the overall inclusion probabilities 1, are equal for all SSUs. 
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The inverse probability weighted mean in this case equals 
the simple sample mean y = D7, 4) Vay / hai Mp: 

We assume throughout this article that the selection 
probabilities 7, , are known for all the PSUs h = 1, ..., H. 
In sections 2 and 3, we assume the PSU counts NV, are also 
known for all the PSUs in the population. In section 4, we 
discuss the common situation where N, is only known for 
sampled PSUs, but the N, for nonsampled PSUs can be 
estimated using a regression model based on auxiliary 
variables known for all PSUs in the population. 

Sarndal, Swensson and Wretman (1992) discussed 
model-assisted alternatives to the HT estimator for two- 
stage samples with auxiliary information available at the 
PSU or SSU level. In the first case, let x, denote a vector of 
PSU-level auxiliary variables for PSU h. The PSU totals 
t, => y,, are assumed to be related to x, according to a 
linear model: 


Bt | op) el B, Mat, ions lee lyeugeH 


(Sarndal et al. 1992). B is estimated by the probability- 
weighted regression 


m T > a m if > 
= ees /(o2n, ,) Dir pbs /(o2n, ,); 
h=) h=1 


where t, =i Yi /TM>,, leading to the projected totals 
fi =x! B, h=1,... H. In practice, estimates Ga either 
simply assumed (e.g., 6,, proportional to a measure of size 
of stratum h) or estimated, replace G; in the above formula. 
The generalized regression (GR) estimator of the grand total 
is 


and the estimate for the mean is T, /N. The term 
74 (t, —t,)/ 7, , is a bias calibration term that makes the 
estimator design- consistent. 

In the second case where auxiliary information is known 
at the SSU level, let x,, denote the set of auxiliary variables 
for. SSUxd im PSUe hin hip. glia 1... ee The 
relationship between the outcome and the auxiliary 
information is modeled by 


dh 
E(y | Xi ) = Xi B, ses ery Var(y;,; ) 
Rs NW a AEN 5 Pons DAI 0 
The probability weighted cal cca estimate for B is 


m Np m Np 
pi 
=| 2 y X hi sy) (o hi T pj )} by > X jj Yai! (03 h 1 ),j ), 


=1 i=] =e 
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where 1,; is the probability for unit (/, 7) to be included in 
the sample. The GR estimator for the grand total is 


H N mn 
f= DY Iet Lye 
Bia Dai 

b=) h=l i=1 Te pj 


where y,; = x1 B . The estimator for the mean is T, iN 

These two methods do not account for the within-PSU 
correlations of outcome. These correlations can be modeled 
by treating PSU means as random effects in a hierarchical 
model. For the case where PSU-level information x, is 
available for all PSUs, one such model is: 


ind 


Yael oy ~ N\u,, 3°) 
p~N,,(9, D) (1) 
where =(,, --, Wy)’, O=(Q,, -» O,) Where pL, is 


the mean outcome in PSU h, 9, =x,8, and D is the 
covariance matrix of the PSU means. The model-based 
estimator of Y is given by 


E(Y |y.x,)= 


1 m = A H A 
(Sb 5, +(N,—M,) Pd aie Nj lin): 


where fi, = Ey ni |Y>X,) and y is the vector of outcomes 
in the sample. 

Different assumptions about @ and D in (1) lead to the 
following models: 


Exchangeable random effects (XRE): (Holt and Smith 
1979; Ghosh and Meeden 1986; Little 1991; Lazzaroni and 
Little 1998): 9, =y,,2=1,..., H and D=1’l,,; 

and Little 1998): 


Autoregressive (AR1): (Lazzaroni 


©, =U,.8=1..,H and D=r {or 3). 

Linear (LIN): (Lazzaroni and Little 1998): ©, = 
a+Bx,,h=1, ..., H and fees ees 

Nonparametric: (Elliott and Little 2000): ©, = 
{@,), 4 =1,..., H, and D=0. 


The nonparametric models in Elliott and Little (2000) 
assume nonparametric mean function relating the outcome 
to the design variables. By assuming D=0, the PSU 
means are modeled to equal the mean function f instead of 
varying around it. Nonparametric mixed models relax the 
assumptions on D (e.g., D = Toh 4) and serve as a natural 
extension of the Elliott and Little (2000) model and linear 
mixed models with a parametric mean structure. 
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It is worth pointing out that some estimators in the above 
family of models correspond to standard design-based 
estimators. For example, in an equal-probability design 
where n, are approximately constant across PSUs, the 
unweighted mean corresponds to the special model-based 
estimator that assumes @,, iS constant. 


2. ESTIMATION FOR THE P-SPLINE MIXED 
MODEL 


The linear structure of @ in LIN model is subject to 
misspecification when the actual mean structure is non- 
linear. The non-linearity problem can be partially solved by 
adding polynomial terms (e.g., quadratic or cubic terms) to 
the fixed effects part in the LIN model. P-spline 
nonparametric mixed models (Lin and Zhang 1999; 
Brumback, Ruppert and Wand 1999; Coull, Schwartz and 
Wand 2001) are even more flexible, since they replace 
polynomials by smooth nonparametric functions. We 
propose the following p-spline nonparametric mixed model 
for inference about the population mean: 


P-spline nonparametric mixed model (PMM): 
0, = f(%,),8=1L-H, D=Uly, 


where f is anonparametric degree p spline function: 
Pp ; K 
MEAN EIE: + Daa + >°B,,,(x- K; ie 
j=l /=1 


where K, <...<K, are K fixed knots, Bo, ...,B,,. are 
coefficients to be estimated and(x)? = x’I(x 20). 

A naive way of estimating Bo, ..., B,,x« 1s to treat them 
as fixed and estimate them together with the variance 
components o* and t~ by fitting a linear mixed model. 
However this method can yield estimates of f with too 
much roughness and variability. To avoid overfitting, the 
roughness of the estimation f can penalized by adding a 
penalty term to the sum of squared deviations, so that the 
solution By, ..., 6, is minimizes 


m y K 
Dal Giese Rg rele +0) Br ‘ 
h=| Tt 


This is achieved in the context of the model by assigning 
Bo...» B, flat priors, (B 4, .- B,4«) @ normal prior 
N,, (0,05). and letting at? los: The result is a 
penalized spline ( p-spline) model. 

When p=l, " is piecewise linear and the coefficients 
Dispuess Poca pand 0°,5, and t~ are estimated by fitting the 
linear mixed model: 


at 
y=X, P+X,ut+6, (2) 


where VS Vip Yin» ee 5 or B = Cpe B, ie = 
(SOREN Caray) spines) Aa Wee 


[1 x, | 
et, 
uj 
@ 
ee 
ae 
bigalge spa) 
(4; — Ky), Ci, Kee) tie 0 0] 
(x, —K,), (%,-Kx), 1 0 
ia ak ss, OS all a eH la | 
X,= ’ 
yy) Or Kp, UT 1 0 
OOo sl 
[ee ai gh) tert Ce ge ON EO gees ol 3 


where bag hin Xan A(X o=Ky) 29 In, Xp,,are both. repeated 
n, times. The random terms u and e are mutually 
independent with 


= (Pind oc Diz, pst pees u,,) ~ N gm (0,G), 


G= i = 0 } 
Oak ial, 
Variance components Seo, and t can be estimated by 
fitting model (2) by restricted maximum likelihood 
(REML). 

The predicted means of PSUs included in the sample 
are given by: i = Xp + Xu, where B=(X'V"Xx,)7 
GY cave a Vy = XG ps where V= 
XG 5 5. ape=ciagl lin, | | and y= 
(Y,--:Y_) - The predicted mean for a PSU h that is not 
selected in the first stage is fi, =x, Br, where 


212 Zheng and Little: 


Fare 


xX), =(1 Xp, (xX, mee a re 
and 
By Ip, B,, oe) Bievplue 


Combining the predictions, we obtain the model-based 
estimator of the population mean 


E(Y | y,x,)= 


| m = ox H _ 
ral pas [m,Y;, a (N,, az n,,) Lp, ] + hee Ni, LL), ) 


3. VARIANCE ESTIMATION METHODS 
3.1 Empirical Bayes Model-based Variance 


Model (2) can be interpreted as a Bayes model in which 


the parameters u=(B,,...,B¢4;,U,,--.»U,,) have multi- 
variate normal prior N;,,,,(0, G), and B,, B,, o°, 
Os and t all have the flat priors. This leads 
to the Bayes posterior variance for the vector 
(Pasdeas-coatieus eiae wey conditional on C2, 
and t as 
Var(Qiapie abe ae ee ae ee 
=0°(X’X +A)" 
where X =[X, X,] and 
0 0 0 0) 
0-0 0 0 
A= ) 2 ’ 
O20) ea5/ G4 te 0 
0 0 0 eae co fe 


where J, and/, are (KXK) and (mxm) 
matrices, respectively. 

The empirical Bayes posterior variance for 
(Bo> Bis > Barays Uys «+» Up)’ iS calculated by replacing 
Ca. and t with their maximum likelihood (ML) or 
restricted maximum likelihood (REML) _ estimates 
Sas and t°, respectively. The empirical Bayes 
method underestimates the true posterior variance, but the 
underestimation is not severe for the sample sizes 
encountered in many survey settings. A fully Bayes solution 
is also possible, but is not covered here. 

The predicted population mean is 


identity 


Trea! N, where 
| =T, 1 T, =Xjp-1N,y;, 1s the sample total, and ie 
is the estimated total for units not included in the sample, 

“aw A A 
1”, )h,+ UND, 


h=m+1 


een in a * T 
ny Seal ewam DACa aya ? (3) 


iB = - aM 


where 
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Np =|(N,—1,).--(Nn — Pn) Niner Ny 
and 
X,= 
; x, ek), Co ie eee eee 

me Jp Vag 

; hae 

0. On deme 
L aetiowe, (UG 1h a kee ae Orr 
I Xn+l (Mn — Ks = (7,4 KY 0 0 . 
ee (i= eae dee eg ie ee 
The empirical Bayes posterior variance for Y = jo N is 


Wary (G05, tk ke) = 
o (NX (x MAA ay 


3.2 The Jackknife Method 


A jackknife variance estimator is developed for the PMM 
estimator. The jackknife replicates are constructed by 
dividing the set ot PSUs into G equal-sized subgroups and 
computing the g™ pseudovalue as i, =GY - (GDR 
where Y is the original PMM eoninaen and Y, z) IS the 
same estimator calculated from the reduced ehoie obtained 
by excluding the elements from the PSUs in the g” 
subgroup. 

The jackknife variance estimate of ¥ is 


oe ry, 
v(Y) Ey 


where Y = ies y ./G. In order to balance the distribution 
of the selection probabilities across the subgroups, sampled 
units are stratified into n/G strata each of size G with similar 
first stage inclusion probabilities, and the G subgroups are 
constructed by randomly selecting one element from each 
stratum. To save computation, estimates 6°, 6; and %* are 
not recomputed for each replicate. That is, we compute 
pseudovalues of (By, By, Byays Uys +1 U,) based on 
the variance components estimated from the whole sample. 
Miller (1974) and Shao and Wu (1987, 1989) proved 
asymptotic properties of the jackknife estimator and 
jackknife variance estimation in the case of multiple linear 
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regression. Zheng and Little (2004) provided a theoretical 
justification for the jackknife method for the p-spline model- 
based estimator in the case of one-stage designs. Numerical 
simulations in section 4 suggest the above described 
jackknife method also works well for the two-stage design. 
Improved performance might be achieved using the 
weighted jackknife proposed by Hinkley (1977). 


3.3 The Balanced Repeated Replication Method 


The BRR method can be applied in stratified designs 
with two units sampled in each stratum. For designs with 
one PSU per stratum, strata are often collapsed (Kalton 
1977) for BRR variance estimation. In our application we 
assume the PSUs are sampled systematically from a 
randomly ordered list. This can be viewed approximately as 
a stratified design with n strata each consisting of PSUs with 
cumulative measures of approximate size 17, z, /n, where 
z; are the measures of size for the PSUs . One PSU is 
sampled from each of the n strata. Assuming n is even, the 
design can be approximated by a stratified design with n/2 
strata with measures of size 2)%,z,/n, and two units 
sampled per stratum. Balanced repeated half samples are 
constructed by selecting one PSU from each stratum, with 
the selection scheme based on Hadamard matrices (Plackett 
and Burman 1946). Let Y, be the p-spline estimator 
computed from the b™ half sample, using the same knots as 
used in the computation using the full sample — the number 
and placement of knots needs to allow the spline model to 
be fitted on each half-sample. The BRR estimator is given 
by Vepr (Y)=1/B >2,(¥, —Y)*. This estimate of the 
variance is subject to some bias, because it treats the design 
as if it was stratified with two PSUs per stratum. 


4. WHEN SOME PSU COUNTS ARE NOT KNOWN 


In sections 2 and 3 we assumed that the PSU counts NV, 
are known for sampled and non-sampled PSUs. In this 
section we discuss the situation where N, is only known 
exactly for the sampled PSUs (labeled 1 through m). We 
also assume that values M,,h=1,...,H of an auxiliary 
variable predictive of N, are known for the whole 
population. For example, the M, may be PSU counts 
estimated from outside sources such as a census. We 
conduct a regression of N, on M, using the sampled 
PSUs and replace the counts N, in (3) for nonsampled 
PSUs with predictions N,,h=m+l,...,H from. this 
regression. The resulting estimate of the total is 


_ m nN H a IS 
React De, (N;, — 1, ey pe, Ni btn - 


The variance estimate of T needs to incorporate the 
additional variability in N, . In particular, a model-based 
variance for T is 
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Var(T |x,,M,,) = Var(E(T | N,,7,.M,)) 
+ E(Var(T |N,,,7,,M,)), 


where 


m 


en H ke 
| 0 OS ER Aad Aig ed OE — Ny) Hy, arcing N; LL, 
and 

Rance) Na iP ao (NOX Xe LXE, 


N= GN Se apeiiNs =n) Nig 
and A are defined as in (3). 

If the models for w, and WN, are both correctly 
specified, the above variance can be estimated according to 


the corresponding models. 


ws anh Agee 


5. SIMULATIONS 


5.1 Simulation Design 


Two simulations are conducted to compare the inverse 
probability weighting method, the model-assisted method 
(Sarndal et al. 1992) and the PMM method in the case of 
two-stage samples. 

In our first simulation, artificial populations are generated 
with different mean functions f(z,,)of the first stage 
inclusion probabilities. Four different mean functions are 
simulated: 1) NULL, a constant function; 2) LINDOWN, a 
linearly decreasing function; 3) EXP, an exponentially 
increasing function; and 4) SINE, a sine function. 

Two combinations of values for variance components are 
sunulated:<))io.—=0.l.and.¢—0.2<.2)6=02,and-¢= 0.1. 
Only normal errors around the mean functions are simulated 
while both normal and lognormal within-PSU errors are 
simulated. 

The population consists of 500 PSUs, and in the first 
stage 48 PSUs are sampled systematically with probability 
proportional to size (PPS) from a randomly-ordered list. The 
PSU sizes are uniformly distributed with values ranging 
from 4 to about 400. The SSU count in each PSU is 
generated from a distribution with mean equal to 1.05 times 
the measure of size and log-normal errors with standard 
deviation 30. 

Two types of second-stage sampling plans are studied: 1) 
within-PSU simple random sampling (srs) with inclusion 
probabilities proportional to the inverse of the first stage 
inclusion probabilities, resulting in an equal inclusion 
probability for all SSUs.; 2) within-PSU simple random 
sampling with the same sampling rate across sampled PSUs, 
so that the resulting inclusion probabilities for the SSUs in 
PSU hare proportional to 7, ,. 
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For each sample drawn under both sampling plans, the 
following methods are applied: 

A. The HT estimator. 

B. The model-assisted estimation method. We use a linear 
model regressing the outcome y,, on the first stage 
inclusion probabilities, which are treated as element- 
level information. The GR estimator is computed by the 
formula given in section 1. 

C. The PMM method, with the first-stage inclusion 
probabilities m,, as the covariate. We use 20 equal 
percentiles of 2,, of the sampled PSUs as the knots for 
p-spline regression. 

D. The PMM method with the PSU means i, estimated 
the same way as in C, but using estimated PSU counts 
from a simple linear regression of N, on the measures 
of size, which are proportional to 2, ,,. This part of the 
simulation is conducted to study the method described in 
section 4. 

Estimates of Y from methods A-D are calculated for 
each of the 500 samples drawn repeatedly from the artificial 
populations (each artificial population is generated only 
once). For the PMM estimator, we compute the empirical 
Bayes, the jackknife (K = 8) and BRR variance estimators 
for each repeated sample. The mean estimate for the 
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variance of PMM and the coverage rate of the 
corresponding 95% confidence interval are used to judge the 
quality of inference. For method D, we study the empirical 
bias of the model-based variance estimator described in 
section 4, together with coverage rates of associated 
confidence intervals. 

In the second simulation study, we draw samples of 
household income data from the 5% public use microdata 
sample (PUMS) for the State of Michigan in the 1990 US 
Census, which we treat as a finite population. This 
simulation is more realistic than the previous simulation in 
that the outcome values are drawn from a real rather than 
simulated distribution. The PSUs we simulate are based on 
the natural geographical clusters called “Public Use 
Microdata Areas” (PUMAs),which are typically counties 
and places. There are 67 PUMAs in the Michigan 5% 
PUMS, with counts of families ranging from around 1,300 
to over 10,000. We increase the number of available PSUs 
by dividing each PUMA into 5, resulting in 335 PSUs. The 
PSU counts ranges from 134 to 3,058. Figure 1 gives the 
scatter plot of one sample of the average household income 
versus sampled PSU sizes together with the regression curve 


A 


f(x). 
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Figure 1. P-spline Regression Curve (dotted line) and the Average Household Income (stars) in Sampled PSUs 
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Five hundred two-stage samples are drawn, each 
consisting of 30 PSUs and 20 SSUs (families) from each 
selected PSU. The first stage sampling is systematic PPS 
where the measures of size are equal to the PSU counts. The 
second stage sample is simple random sampling with 
inclusion probabilities proportional to the inverse of the first 
stage inclusion probabilities. In the estimation of the mean, 
we use the true PSU counts as variable x, , with values 
proportional to the first-stage inclusion probabilities. We 
apply the p-spline nonparametric mixed model formulated 
in (2). We use 10 equally spaced sample percentiles of the 
PSU counts as the knots in the p-spline. 


5.2 Results 


Table 1 gives the empirical bias and root mean squared 
error (RMSE) from four estimation methods of the finite 
population mean applied to equal probability sample from 
populations generated with both normal and log-normal 
within-PSU errors and two (o,t) combinations. The 
empirical bias and RMSE are estimated by the mean bias 
and squared error from the 500 repeated samples. 

Table 1 suggests the PMM _ based methods give 
estimators with small biases. In the case of equal probability 
sampling, the PMM estimator is roughly as efficient as HT 
estimator when the mean function f is constant. In the 
more general cases such as NULL and LINDOWN, where 
f is linear but not constant, the linear model-assisted and 
PMM method are comparable and both are more efficient 
than the HT estimator in terms of root mean squared error. 
For populations EXP and SINE, whose mean functions are 
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not linear, the PMM method is superior to both the HT and 
the linear model-assisted estimators. The improvement of 
efficiency requires the knowledge of complete design 
information including probabilities 2,, and PSU counts 
N,, for the whole population. When using estimated PSU 
counts N , in the place of JN, , the resulting estimator is less 
efficient than in the case with known WN, , but the PMM 
estimator can still outperform the HT when the mean 
function is non-constant. Comparisons on populations with 
normal or log-normal within-PSU errors result in similar 
findings. 

Similar gains for the PMM method are seen in Table 2, 
for the case of unequal probability sampling. This suggests 
that the key to improved efficiency is the better prediction 
given by the nonparametric models. Tables 1 and 2 both 
suggest that the p-spline model-based estimators have very 
small empirical design-biases. We believe this is because 
the flexible mean functions yield good predictions of the 
PSU means. 

Table 3 compares point estimation and coverage of 95% 
confidence intervals from three variance estimation methods 
for PMM: the empirical Bayes model-based method, the 
Jackknife method and the BRR method. The empirical 
Bayes method is generally satisfactory but tends to 
underestimate the true variance of PMM estimator, resulting 
in under-coverage in some cases. The jackknife and the 
BRR methods tend to yield more robust estimates for the 
variance. In general, PMM yields estimates with improved 
efficiency over the traditional HT and linear model-assisted 
estimators and satisfactory design-based inferences. 


Empirical Biases and RMSE of PMM, HT, GR and PMM with Estimated N, for Samples Under Equal Probability Designs 


Table 1 
PMM Horvitz-Thompson Linear Model- PMM with 
Assisted Estimated N,, 
=3 
(x10 ~) BIAS RMSE’ BIAS RMSE BIAS  RMSE_ BIAS RMSE 

Normal NULL ile 29.7 0.8 30.0 0.8 29.9 123 301 
Errors LINDOWN Sy) BOM, 3.6 36.4 Bui SOR, ae 30.4 
Ul (ee EXP —4.4 29.1 —9.4 5o0) = OS 36.7 Ane 29.1 
G=a0sil SINE 4.8 SVL) Qe 42.0 (3) 35.9 52) 34.3 
Normal NULL Sail 22.0 6.6 OD 6.6 Diep 5» DOS 
Errors LINDOWN 0.5 20.4 —0.6 Dial —03 20.5 1.6 20.6 
= Ou EXP 0.9 2B aAll 1.9 50.3 — AnD Sle 0.4 23.4 
o-0.2 SINE ING; 2273 6.5 34.9 3.8 26.4 8.0 26.4 
Log-normal NULL 127 325 0.9 32.3 0.7 32.3 165 BES 
Errors LINDOWN 2.9 31.9 3.8 39.4 Ol S25 32 32.0 
Tia EXP —0.6 28.4 —59 Ses) —6.9 36.4 —().3 DEES 
or — Oh SINE 6.9 33.8 1 AS, =o 39.0 —3.1 Bon) 
Log-normal NULL 8.5 30.5 9.6 Siles 9.2 Sy) 9.1 30.8 
Errors LINDOWN 3.6 Byn5 1.9 37) 3.6 BOA 6.4 33.1 
= Oa EXP 3.9 29.0 6.8 53.8 1.0 34.4 Sh] 29.4 
o=0.2 SINE =A°) 30.1 = 240) 44.7 120 38.4 =Bh8 35.9 
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Table 2 
Empirical Biases and RMSE of PMM, HT, GR and PMM with Estimated N, for Samples Under Unequal Probability Designs 
PMM Horvitz-Thompson Linear Model- PMM with 
Assisted Estimated JN, 
3 
(xl10~) BIAS RMSE BIAS RMSE BIAS RMSE BIAS RMSE 
Normal NULL —4.5 29.3 —3.7 33.6 —3.2 30.5 —4.5 29.3 
Errors LINDOWN- —0.9 270 hel 355 1.8 PAT fea —0.7 26.9 
an (Oe EXP 5.8 32.0 1.9 56.8 0.4 39.4 14.1 34.4 
o— OU SINE The 30.1 6.1 39.5 3.6 32.8 58. 30.4 
Normal NULL iia eu Ones —7.7 24.9 —6.6 DAM —7.6 Dil 
Errors LINDOWN (ll DOH By 30.6 12 20.7 aS S| 
ee (Vell EXP -—2.3 20.9 —6.5 53.5 —7.2 30.0 —3.0 20.9 
on Oye SINE 5.6 20.9 6.9 36.2 4.0 28.6 43 Pa a 
Log-normal NULL —0O.5 28.5 —2.0 30.6 —2.1 29.5 —0.3 28.5 
Errors LINDOWN 5.4 32.6 5) (0) 39.0 Sx), 34.1 6.0 327 
eA EXP —13 28.6 —7.6 62.6 —7.1 36.8 —9,3 30.3 
= Opi SINE Bei Sie Be el 0.1 36.1 1.6 31.0 
Log-normal NULL 3.6 22.8 Su7 28.8 ST) 24.2 3.6 DOT 
Errors LINDOWN 6.0 26.8 9.3 SIS ES DS 2S 26.0 
p= hil EXP 0.8 26.3 —2.3 50.8 —3.5 3351 tes 29.0 
O02 SINE Bo 26.9 2.9 37.6 —0.1 3032 22, 27.8 
Table 3 
Variance Estimation and Empirical Coverage Rates of 95% C.I. Using the Model-based, Jackknife and BRR Methods 
Empirical Empirical Bayes Jackknife(K = 8) BRR 
variance Model-based 
Estimate Estimate Estimate 
Shape (10°) (x10) % (x10™) % (x10) % 
Normal NULL 88 74 92.8 94 96.4 96 94.4 
Errors LINDOWN 94 We 89.6 94 94.6 98 94.2 
a)? EXP 85 70 91.4 88 94.6 85 93.4 
o=041 SINE 83 67 91.6 90 95.8 85 94.4 
Normal NULL 48 45 93.8 48 96.0 49 93.8 
Errors LINDOWN 42 45 96.8 51 96.2 51 96.8 
p= On EXP 53 54 95.0 61 97.2 59 95.2 
o=02 SINE 44 46 95.8 55 96.6 49 96.0 
Log-normal NULL 104 83 91.8 104 94.8 100 93.6 
Errors LINDOWN 102 98 93.6 106 95.6 107 95.0 
p= (02 EXP 81 Wa 93.4 97 96.4 89 94.8 
o=0.1 SINE 92 99 94.8 97 95.2 92 93.4 
Log-normal NULL 93 97 94.2 100 96.2 99 95.2 
Errors LINDOWN 104 101 93.6 106 96.0 102 92.8 
T= Onl EXP 84 81 94.6 84 95.2 82 95.0 
= Ow SINE 110 96 94.4 98 95.6 92 93.0 


Tables 4 and 5 give the empirical variance of the PMM 
estimator when the non-sampled PSU counts JN, are 
estimated. They also give the mean estimated variance of 
this estimator and corresponding coverage rates by the 95% 
C.I. The confidence intervals are calculated by the usual 
normal theory intervals based on our point and variance 
estimators. These two tables show the inference method 
discussed in section 5 tends to underestimate the true 
variance of PMM estimator using N ,» giving in occasion 
under-coverage of the population mean. It remains to be 
studied in the future whether the JRR and BRR methods 
also yield satisfactory inferences for this method. 


For the simulation study using 5% PUMS data, the 
simple mean has bias = —50.9 and RMSE = 2,600 and the 
p-spline nonparametric mixed model based method has 
bias = —41.9 and RMSE =2,153.Thus both methods have 
small bias and the model-based estimator has a RMSE 17% 
less than the RMSE of the simple mean. This improved 
efficiency is due to the fact that the average household 
income decreases for as the number of families in the PSUs 
increases (figure 1). The PMM method exploits this 
relationship in its predictions. 
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Table 4 
Variance Estimation and Empirical Coverage Rates of 95% C.I. Using P-spline and Estimated PSU 
Counts, Population Simulated with Normal Errors 


o=—O and) 7=02 C=) 2 enanda G—Oal 
Empirical Estimated Empirical Estimated 
Variance Variance C Variance Variance 
Os 1g overage oe: ae Coverage 
(x10) (x10-°) Rate 107") (x10~) Rate 
NULL 90 76 91.8 50 46 93.2 
LINDOWN 93 74 90.4 43 46 95.6 
EXP 85 UP 93.0 5) 56 96.2 
SINE 110 98 94.8 50 55 97.6 
Table 5 


Variance Estimation and Empirical Coverage Rates of 95% C.I. Using P-spline and Estimated PSU 
Counts, Population Simulated with Log-normal Errors 


GO aie Tp. SO7 CSOs Chile =O 
Empirical Estimated Empirical Estimated 
Variance Variance C Variance Variance 
3 iy: overage bs iz Coverage 
(x10 ~) (x10 ~) Rate (x10 ~) (x10 ~) Rate 

NULL 105 84 91.8 95 99 94.8 
LINDOWN 103 98 94.4 110 102 94.4 
EXP 81 719 94.6 87 83 94.2 
SINE 110 150 96.4 91 130 95.8 


6. DISCUSSION 


Previous parametric model-based inference methods 
have been criticized mainly for their potentially large design 
biases when the model is misspecified. In our nonparametric 
models, the linearity assumption is replaced by a much 
weaker assumption of a smoothly-varying relationship. As a 
result, the model-based estimators are more robust, having 
small biases for a variety of population shapes. 

Design information such as inclusion probabilities plays 
a key role in the model-based inference. Inverse-probability 
weighted methods imply simple assumptions about the 
relationship between the outcome variables and the design 
variables. With the method we propose, the gain in 
efficiency is realized by applying nonparametric models that 
relax these assumptions. 

Our study has an interesting finding that the model-based 
estimators can be more efficient than the simple mean for an 
equal probability design. In other studies, we also find gains 
in efficiency from p-spline nonparametric mixed model in 
estimating post-stratum means in post-stratified samples. 

The empirical Bayes method, the jackknife and BRR 
methods all give good confidence coverage with confidence 
intervals that are narrower than those given by the 
traditional methods. However, we expect the empirical 
Bayes method to be sensitive to model assumptions on the 
variance components (e.g., constant within-PSU variances). 
When the PSU counts are not known for the sample but not 
for the whole population, model-based estimates of the 


unknown counts can still provide sound estimates of the 
population mean, if the model tracks the true PSU counts 
precisely enough. The model relating these counts to the 
auxiliary variable was treated parametrically here, but this 
could also be specified nonparametrically without much 
difficulty. 

We believe p-spline nonparametric mixed models can be 
applied to more complex designs such as stratified and 
multi-stage designs. We also believe without much more 
effort our methods can be generalized for binary or ordinal 
outcomes. 
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A Finite Population Estimation Study with Bayesian Neural Networks 


FAMING LIANG and ANTHONY YUNG CHEUNG KUK ' 


ABSTRACT 


In this article, we study the use of Bayesian neural networks in finite population estimation.We propose estimators for finite 
population mean and the associated mean squared error. We also propose to use the student f-distribution to model the 
disturbances in order to accommodate extreme observations that are often present in the data from social sample surveys. 
Numerical results show that Bayesian neural networks have made a significant improvement in finite population estimation 


over linear regression based methods. 


KEY WORDS: Bayesian model averaging; Bayesian neural network; Evolutionary Monte Carlo; Finite population; 


Markov Chain Monte Carlo; Prediction. 


1. INTRODUCTION 


Regression estimation is widely used in sample surveys 
for incorporating auxiliary population information (Cochran 
1977) with the underlying model 


y, =B, +x, B, +. 


where y, is the survey variable for the t’ element of a 
population, x, = (Xi 50-2 Xp is the vector of auxiliary 
variables associated with y,,Bo,B,,-..,.B, are the 
regression coefficients, and €, is the independent distur- 
bance with zero mean and common variance. Although this 
model generally performs well, it has several inherent 
limitations. First, the model is specified linearly and thus 
can not capture some types of nonlinear relationship, which 
may be essential in some applications. Second, the least 
squares estimate, which is widely used for the model (1), 
may not be reliable in the presence of collinearity among the 
auxiliary variables. In this case, techniques, such as 
condition number reduction (Bankier 1990), ridge 
regression (Bardsley and Chambers 1984), and various 
variable selection procedures (Silva and Skinner 1997), 
have to be used to improve the poor prediction performance 
of the model. Third, in the presence of outliers, the least 
squares estimate may be severely affected by the outliers. 
There are attempts to lessen the dependence of estimators 
on the linear model (1). Firth and Bennett (1998) identify a 
sufficient “internal bias calibration” condition under which a 
model-based estimator is automatically design consistent, 
regardless of how well the underlying model fits the popu- 
lation. The condition is met by certain estimators based on 
linear models, certain canonical link generalized linear 
models and nonparametric regression estimators constructed 
from them by a particular style of local likelihood fitting. 


ated. hi). 25 inl) 


Bias can also be calibrated externally, if not internally. 
Chambers, Dorfman and Wehrly (1993) start with a 
predictor of the population mean based on a heteroscedastic 
linear model and adjust for its bias using nonparametric 
regression. Kuk and Welsh (2001) propose a robustified 
model-based approach whereby a working model is first 
fitted using robust methods and subsequently the condi- 
tional distributions of the residuals given x are estimated 
nonparametrically to account for local model departure or 
outliers in localized regions. 

Another way of incorporating auxiliary information into 
an estimator into an estimator in a design consistent manner 
is the model-calibrated approach first proposed by Deville 
and Sérndal (1992). The basic idea is to choose weights that 
satisfy certain calibration equations and are closest to the 
normal Horvitz-Thompson design weights according to 
some distance measure. Theberge (1999) applies the cali- 
bration technique to estimate population parameters other 
than the means. More recently, Wu and Sitter (2001) 
extends the calibration approach to deal with nonlinear as 
well as generalized linear models by using the fitted values 
under these working models to set up the calibration 
equations. The model-calibration approach can be classified 
as ““model-assisted” because while the efficiency of the 
model-calibrated estimator depends on the validity of the 
model, consistency does not. 

There is certainly a growing trend in the survey literature 
in using nonlinear and nonparametric regression. Instead of 
model (1), one considers, 


yi = g(x, ) Fe, 
where the regression function g(-) can be any arbitrary 


smooth function. Dorfman (1992) estimates g using the 
Nadaraya-Watson kernel estimator g to result in the 
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following model-based estimator or predictor of the finite 
population mean, 


n N 
Dx mp? Hurt Dd fa} 

t=1 t=n+l 
where it is assumed without loss of generality that the 
sample consists of the first n elements of the population. 
Kuk (1993) makes use of kernel method to estimate the 
conditional distribution of y given x as a way of incorpo- 
rating auxiliary information in the estimation of the finite 
population distribution of y. For the case of scalar x, Breidt 
and Opsomer (2000) estimates g using local polynomial 
regression with design weights incorporated to account for 
the sampling design used and propose a generalized 
difference estimator, 


n pee N n 
Sip =" ps a aso| = wily) rf 
| t t= f= 
where 2, is the sample inclusion probability. It can be 
shown that the weights w, are calibrated to match the totals 
of x up to the g* order, where g is the order of the local 
polynomial. As a consequence, y,p is exactly model- 
unbiased if the true regression function is a polynomial of 
degree q or less. Breidt and Opsomer (2000) also show that 
y,p 1s asymptotically design-unbiased and consistent under 
mild conditions. For more discussions on nonlinear and 
nonparametric methods, see Valliant, Dorfman and Royall 
(2000) (chapter 11). 

In this paper, another nonlinear regression method, 
Bayesian neural network (BNN), is applied to the problem. 
BNN has an important advantage of being able to handle 
multivariate auxiliary variables and model selection with 
ease, which is not the case for many other nonlinear and 
nonparametric techniques. BNNs were first introduced by 
Buntine and Weigend (1991) and MacKay (1992), and were 
further developed by Neal (1996), Miiller and Insua (1998), 
Marrs (1998), Holmes and Mallick (1998), and Liang and 
Wong (2001). But the BNN proposed in this paper is 
different from those cited above in one important respect: A 
prior is put on each network connection, instead of only on 
the number of hidden units as done in the literature. This 
allows us to treat the selection of network structure and the 
selection of input variables (auxiliary variables) uniformly. 
The network is trained by sampling from the joint posterior 
of the network structure and connection weights. The 
sampled network has often a sparse structure, which 
effectively prevents the data from being overfitted. A heavy 
tail distribution, such as the student ¢-distribution, is 
proposed to model the disturbances of the data with outliers. 
Numerical results show that BNN models have offered a 
significant improvement over the linear regression based 
models in finite population estimation. 


The remaining part of this article is organized as follows. 
In section 2, we describe the BNN models and the 
associated estimators for finite populations. In section 3, we 
present our numerical results for one finite population 
example with two choices of auxiliary variables and 
comparisons with various linear regression based models. In 
section 4, we present our numerical results for another finite 
population example demonstrate how a cross-validation 
procedure can be applied to determine the parameter setting 
for BNN models. In section 5, we conclude the paper with a 
brief discussion. 


2. FINITE POPULATION ESTIMATION WITH 
BAYESIAN NEURAL NETWORKS 


2.1 Bayesian Neural Network Models 


Suppose we have data pairs D={(x,, y,),...,(X,,y,)}, 
which were generated from the relationship 


¥, = 8(%,) +é,, (2) 


where yy, ORY, = (Nqy.-B%_ NE RE, OO, ds the due 
regression function of unknown form, and €,/o ~ t(v) with 
v >2 being a known degree of freedom of the f-distribution. 
Here g(-) may be highly nonlinear, and o is an unknown 
scale parameter. We use the student f-distribution, instead of 
the Gaussian distribution as usual, to model the disturbances 
in order to accommodate extreme observations that are often 
present in the data from social sample surveys. 

Before describing our BNN model, we first give a brief 
description for feed-forward neural networks. Figure 1 
illustrates a one-hidden layer feed-forward neural network. 
It consists of four types of units, bias units, input units, 
hidden units, and output units. The unit to which the input 
features are presented is referred to as the input unit. The 
bias unit is a special type of input units with a constant 
input, say, 1. The unit where the network output is formed is 
referred to as the output unit. The hidden unit is so called 
because its input and output are only used for internal 
connections and are unavailable to the outside world. In a 
feed-forward neural network, each hidden unit inde- 
pendently processes the values fed to it by the units in the 
preceding layer and then presents its output to the units in 
the next layer for further processing. It has been shown by 
several authors (Cybenko 1989; Funahashi 1989; Hornik, 
Stinchcombe and White 1989) that neural networks are 
universal approximators in that a one-hidden layer feed- 
forward neural network with linear output units can approxi- 
mate any continuous functions arbitrarily well on compact 
sets by increasing the number of hidden units. To survey 
regression, this is an important advantage of neural network 
models over other regression models. In the survey regres- 
sion literature, whether model-assisted or model-based, 
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there is usually considerable attention paid to the conse- 
quences of model misspecification. The neural network 
model avoids this consideration partially due to its specific 
property of universal approximation. In section 2.2.1, we 
show that as the sample size is large, the unknown 
regression function g(-) in (2) can be well approximated by 
BNN models, regardless of the true function form of g(-). 
Essentially, BNN falls into the class of data-driven methods. 


Output Unit 


Hidden Units ae 


Bias Unit pel SN 


Input Units oe ; 


Figure 1. A fully connected one hidden layer feed-forward neural 
network with 4 input units, 3 hidden units and 1 output unit. 
The arrows indicate the direction of data feeding. 


In our BNN model, the function g(-) in model (2) is 
approximated by a function of the form 


g(x,,a, B, Y)= Ao pe 43x, LO; 


1 
M P 
+ >) Bip, v Dea il Lae +1 501 a ao) 
j=l i=l 


where /, is an indicator function which indicates the 
effectiveness of the connection ¢; M denotes the maximum 
number of hidden units which is specified by users; a, 
denotes the bias term of the output unit, a,,...,0 : denote 
the weights on the connections from the input units to the 
output unit; B,,...,8,, denote the weights on the 
connections from hidden units to the output unit; y , 
denotes the bias term of the “ie hidden unit, y Pisce, 
denote the weights on the connections from the input units 
to the j" hidden unit; and w(-) denotes the activation 
function. Sigmoid and hyperbolic tangent functions are two 
popular choices for the activation function. We set 
\(z) = tanh(z) for all examples of this paper. 

Let A be the vector consisting of all indicators of model 
(3). Note that A specifies the structure of the corresponding 
network. Let @=(0),0,,...,4,),B=(B,,---,By), ite 
Mio 38s Vo te Lae: Ee ar ‘and 0= CR ie Roe ye 
where a,,f, and y, denote the non-zero subsets of a, B 
and y, respectively. Thus, the model (3) is completely 
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specified by the tuple (8,A). For simplicity, in the 
following we will use 8, to denote a BNN model and use 
g(x,,8,) to re-denote the function g(x,,a,B,y). Also, 
we let 6, =(0,A), and use 8, and (0,A) exchangeably. 
To ae a Bayesian mane for model (3), we have the 
following prior distributions: a, ~N(0,o7) for 
a,€0,;B,~N(O0,o,) for B,€B,;7;~N(0,o,) for 
Ya © Ya; and f(o*) ~1/0?. oe total number of effective 
counreions a Avisom= > 25 te eee 145 OCLo das) te 

Ma Sop I, te , where 5(z)=1 if z>0O and 0 otherwise. 
The model A. is subject to a prior probability that is 
proportional to the mass put on m by a truncated Poisson (A) 
with rate A, 


ie ag 
P(A) = epg mao Agi al 
0, otherwise 


where U =(M +1)(p+1)+M is the number of connec- 
tions of the full model in which all / c= 1; and 
Ti eon umlstHere. we tet Oudenotethexsetrof «all 
possible models with 3<m<U. We set the minimum 
number of m to three based on our views: neural networks 
are usually used for complex problems, and three has been a 
small enough number as a limiting network size. In these 
prior distributions, o. F oF é o; and 4 are hyper-parameters to 
be specified by users (discussed below). Furthermore, we 
assume that these prior distributions are independent 
a priori. Thus, we have the following log-posterior (up to an 
additive constant), 


p= 1 


logn(@ ,|D)= Constant — [2 Se ] log oes v : 1 ye 


as 0 2 
oe Bers a») 
Vion 


seule 
veo + 
B 
Mp a 
LES ‘he logo? + —" log(2n) 
D iatnial) faba O., 2 
+ mlog i — log(m!). (4) 


Our BNN model is different from other BNN models 
existing in the literature in two important respects. First, the 
input variables of our BNN model are selected automati- 
cally by sampling from the joint posterior of the network 
structure and weights. Second, the structure of our BNN 
model is usually sparse and its performance less depends on 


222 Liang and Kuk: A Finite Population Estimation Study with Bayesian Neural Networks 


the initial specification for the input patterns and the number 
of hidden units. The sparse is in the sense that only a small 
number of connections are active in the network. So our 
BNN model avoids the problem of overfitting in a more 
natural way. 

For data preparation and hyperparameter setting, we have 
the following suggestions. To avoid some weights that are 
trained to be extremely large or small (in absolute value) to 
accommodate different scales of input and output variables, 
we suggest that all input and output variables be normalized 
before feeding to the networks. In all examples of this 
article, the data is normalized by (y,—y)/S,, where y 
and S, denote the mean and standard deviation of the 
training data, respectively. Based on the belief that a 
network with a large weight variation usually has a poor 
generalization performance, we suggest that 5.85 and 3, 
are chosen for moderate values to penalize a large weight 
variation. For example, we set oO. = oF 6; =. for,all 
examples of this article. The setting should also be fine for 
the other problems. The value of ( reflects our belief on the 
network size needed for the data under consideration. Here 
we follow the suggestion of Weigend, Huberman and 
Rumelhart (1990) to choose A such that the number of 
connection weights is about one tenth of the size of the 
training sample. In one simulation, we assessed the 
influence of 4 on BNN model size and predictionability. 
The numerical results suggest that the prediction ability of 
BNN models is rather robust to the variation of A, although 
the BNN model size increases slowly as 4 increases. 

To sample from the posterior (4), a Monte Carlo 
algorithm, so called the reversible jump evolutionary Monte 
Carlo (RJEMC) algorithm, is developed. This algorithm 
extends the evolutionary Monte Carlo algorithm (Liang and 
Wong 2001) to sample from a variable dimensional space 
by incorporating some reversible jump moves proposed in 
Green (1995). For details of the algorithm, please refer to 
the support documents and software for the paper. They are 
available at http://www.stat.tamu.edu/~fliang. 


2.2 Finite Population Estimation with Bayesian 
Neural Networks 


2.2.1 Bayesian Model Averaging 


In this subsection, we review some basic results of 
Bayesian model averaging and show one theorem for BNN 
models, which form the theoretical basis for the estimators 
described in section 2.2.2. Suppose that we are interested in 
estimating the quantity p(@,), which is a function of both 
A and 0. The Bayesian estimator of p(@, ) can be written as 


K 
E,p(Ox)= >, P(A, |D) [p@,.A,)n(0,|A,,D)d0,, (5) 
k=0 


where K denotes the total number of models under consi- 
deration, 8, denotes the parameters associated with model 


A,, and 2(0,|A,,D) denotes the posterior density of 0, 
conditional on model A,. Madigan and Raftery (1994) 
argued for this estimator that Bayesian model averaging 
(averaging over all the models in this fashion) accounts for 
the model uncertainty, and provides better predictive ability, 
as measured by the logarithmic scoring rule, than using any 
single model A,. See Hoeting, Madigan, Raftery and 
Volinsky (1999) for a tutorial on Bayesian model averaging. 

Suppose that samples (0,,A,),...,(8,,,A,) have been 
drawn from the posterior distribution 2(@,|D) by a MCMC 
algorithm, then p(@, ) can be estimated by 


A ye 
(0, )= ee p(O,,), (6) 


where 0, =(0;,A;). Applying the standard Markov chain 
theory (Tierney 1994; Roberts and Casella 1999), under 
regularity conditions we have the following results. If 
ji. | (8, ) I< co, then 


1 M 
<a po, ) na E,p(0, ), a.s., (7) 
M is 
as M — ce. Furthermore, if E, | p(0,) P< for some 
6 >0, then 


i=l 


M 
mS p(0, )— £000) ~> N(0,7”), (8) 


for some positive constant t’ as M — 9, and the conver- 
gence is in distribution. 

Similar to (7) and (8), we have the following theorem for 
BNN models, of which proof is presented in Appendix. 


Theorem 2.1 Let D={(x,,y,),...,(%,,y,)} denote a 
simple random sample drawn from a population which can 
be modeled* by*model (2). Let. (OA, Oy, Aa) 
denote the sample drawn from the posterior distribution 
1(0,|D), given in (4), by a MCMC method. Then, for any 
xX, drawn from the same distribution with the observations 
D, we have 


(a) 
E,, | &(%,0,) |<, (9) 
for some 5>0, as n> 9. 
(d) 
| lg 
— > 8(X,0,) > 8(%), as, (10) 
Mis 
as n—cand M >, 
(c) 


2 1 es A 
min ft0.0, )-a(4)] > 80.2) (11) 
i=l 


for some positive constant ™ as n—co and 
M —-», and the convergence is in distribution. 
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To show some properties of moments of 1/M >, 
2(X,,9 ,,),we need the following theorem (Billingsley 
1986, page 348, Corollary), 


Theorem 2.2 Let r be a positive integer. If X, — X in 
distribution and sup,, E|X,,\""°<°, where 5>0, then 
E|X|'<0o and EX’ — EX’. 


Following from (9), (11) and Theorem 2.2, we know 


ree ; 
MEY a0, )~8(%0)| ies (12) 


i=] 


as n— co and M — . It implies that 
gy $(xX,,0, )- enue Tea (13) 
MSs Ei X09 Ua )— §&\X M M 


holds as n and M are both large. 

Note we have shown that (11) and (13) hold as the 
sample size n— ce. In the context of finite population, 
especially for a small finite population, a more precise 
expression for (11) and (13) would be 


bw 
Mir) EY &C45.9,)~ Day) | 9 NOPD (14) 
i=l 


and 


1 se nee 
E| — > 2(x,),0, )—E(|D, x.) | =— +0] — |, C5 
ad a 094, ) (Yo 3) M (*) (15) 
where E(y,|D,x,) denotes the prediction of y, which is 
the survey variable corresponding to x,. The equations (14) 
and (15) take into accounts the possible bias of the sample 
D. In the case that the population constitutes many exact 
copies of the sample D, E(y,|D,x,) = g(x) holds, and 
equations (14) and (15) are reduced to (11) and (13), 
respectively. 


2.2.2 BMA Estimators in Finite Populations 


Consider a finite population of MN distinguishable 
elements. Associated with the i elements are the survey 
variable y, and the auxiliary variables x,. The values 
X,,-..,X, are known for the entire population, while y, 
is known only if the i" unit is selected in the sample. 
Suppose a simple random sample D= 
{(X,,y,).---,(%,,¥,)} has been drawn from the finite 
population, a BNN model has been built for the sample, and 
COE AG); 2254057. A,,,) Unhaye. beens drawn from the 
posterior distribution of the BNN model, the BMA 
estimator for the mean of the finite population is 


=/9+—_> y Basen, 


i=l t=n+1 


YBNN — 


where y is the sample mean of y,,...,y,, and f=n/N 
is the sample fraction. About this estimator, we have the 
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following comments. First, yg, is a model-based esti- 
mator, so that all the inference is with respect to the model 
for the y,’s, not the survey design. As long as the model 
holds, the BNN estimator will have the mean squared error 
properties described below for any ignorable sampling 
design. Second, this estimator is identical to that proposed in 
Dorfman (1992), except that the BNN is replaced by a 
kernel-based regression. Third, this estimator can be used to 
estimate the mean of a finite population as long as each of 
the unsampled elements has the same distribution as the 
sample D. 

The accuracy of an estimate can be measured by its mean 
squared error E(Ygxx —Y)*, where Y denotes the true 


population mean. To estimate E(Y gx, —Y)*, we first 
consider 
E\(Fanv =Y)° | D, xo 
1 wf ; 
—y Se ) 
MN at =r N 
= 1 N ID, Xr 
-— x )+eE 
Nm, (86 re) 
1 M N : 
N- . MN = i= n 
Bi — E ( n) rare ID, x 
$00) > etx, ) 
t=n+1 
+ —" varle,) 
haan ‘ | 
eer ak 5 O(%, 0) 
era ee ae aes 
ey i} 
3 ger cae, — E(y, |D, Xo tee Gs 
se nul aXe oe EA) 
all 
N-n 
2 ee a 
N~ var€,) 
2 . 
D+ 1EG,)1D.X*,)-EG,)} 
1- 
x Se (16) 
where X” =(X,4),-..Xy) denotes the set of auxiliary 


vectors of the unsampled elements; y, denotes the 
averaged survey value of the unsampled elements, and 


N 


Dorin: 


N Sit t=n+1 


EO 
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The last approximation of (16) follows from (15), that is, as 
Mis large, 


2 2 

1 M M x os N Tp 
Boos (xO Q0> USP ED A) es 

(aan 2 oy: a) fEV,| 1 M 
for some positive constant 1%. The term 


E(y,|D,X*,)-—E(j,) is the prediction bias due to the 
randomness or sampling bias of D. Following from (16), we 
have 


4 


ee a: 


Dee aqetraye 
ere: iy, 


E(Ypnn En 


f vare,). (7) 


E\E(¥, |D,X*,)-E(y,)} + 


The quantity t7, can be estimated by the batch means 
method (Roberts 1996) as follows. Run the Markov chain 
for M=rs iterations, where s is the batch size and is 
assumed suffciently large such that 


< ti 1 ks N ‘, 
Yenne =J Yt » DAES Mo), 
5 i=(k—-1)s+1 t=n+1 


is approximately —_ independently NG ye 
E(y,\D, X.,),t3,/s. Therefore t;, can be approximated 
by 
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2D (Yenn,k Vas (18) 


n2 
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which can be substituted into (17) in leu of E a Under 
the assumption €,/o ~ t(v), the BMA estimator var(€,) is 


Vv 


var(e,) = 


M 
ea (19) 


Under the assumption that the population is made up of 
exact.( copies) tof the training data, we _ have 
E(y,|D, X~,)—E(y,) = 3—y, where } denotes the fitted 


sample mean, and 


rae 1S, i - 
E(j — y) -}"¥e,| =>_ varl€,) (20) 
where €,= 4, 8(x,,0, )/M — y, is the residual of the t" 
element of D, and €,’s are assumed to be iid and 
E(€,)=0. Under the true model, we have var€,) = 
var(e,). Hence, we suggest E{E(y,|D, X™,)- E(y, )}° 
be estimated by 


Bias” ee rN. (21) 
n 


In summary, E(¥ yy —Y)° can be estimated by 
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le fn. Ty eee 
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n 


BLY STS var, ). (23) 
We note that this estimate is identical in form to that given 
by Cochran (1977) for the linear regression estimator. 


3. FIRST SIMULATION STUDY 


3.1 The Data 


Our simulation population comprises 426 records for 
heads of household surveyed using the sample (long) 
questionnaire during the 1988 Test Population Census of 
Limeira, in Sdo Paulo state, Brasil. This test was carried out 
as a pilot survey during the preparation for the 1991 
Brazilian Population Census. For a detailed description for 
the test census, see Silva and Skinner (1997). We followed 
Silva and Skinner (1997) to consider the total monthly 
income as the main survey variable (y) together with 11 
potential auxiliary variables, namely, 


x indicator of sex of head of household equal male; 

X2 indicator of age of head of household less than or equal to 35; 

x3 indicator of age of head of household greater than 35 and less 
than or equal to 55; 

X4 total number of rooms in household; 

XS total number of bathrooms in household; 

X6 indicator of ownership of household; 

x7 indicator that household type is house; 

Xg indicator of ownership of at least one car in household; 

Xo indicator of ownership of color TV in household; 

X10 years of study of head of household; 

Xi proxy of total monthly income of head of household. 


Figure 2, the scatter plots of y versus the 11 auxiliary 
variables, shows that a linear regression model is not 
appropriate for the data. Although y and x,, are strongly 
linearly correlated, the scatter plots of y versus some other 
auxiliary variables, say x,,x; and x,,, suggest that their 
relationships can not be well modeled by a linear regression. 
In addition, if the data is modeled by a linear regression, the 
outlier, the 53” element, may have a high influence on 
fitting and prediction of the model. More precisely, if the 
element is included in the training data, the fitted response 
curve will have a up-drift comparing to the true curve and as 
a result the finite population mean will be overestimated; if 
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the element is not included in the training data, prediction 
will proceed as though there were not outliers and as a result 
the finite population mean will be underestimated. The 
presence of the strong influence element also mounts a great 
challenge on BNN models and other data analysis strategies. 

We followed Silva and Skinner (1997) to construct two 
alternative sets of auxiliary variables for simulations. The 
first set contains x,,...,x, and x,,, which includes the 
proxy variable x,, and has a reasonable explanatory power 
in predicting y. The second set contains x,,...,X,), Which 
has a weaker explanatory power than the first one due to the 
exclusion of x,,. So these two sets illustrate the predictive 
performances of BNN models with strong and weak 
auxiliary variables, respectively. As in Silva and Skinner 
(1997), 1,000 sample replicates of size 100 from this 
simulation population are selected by simple random 
sampling without replacement. The following computation 
were performed on the 1,000 replicates. 

For each replicate, say k, it was analyzed by BNN 
models and various linear regression based strategies 
(reviewed below). For any strategy, the population mean 
estimate and its estimated mean squared error for the 
replicate k are denoted by y(k) and V(y(k)), respectively. 
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represents the 53“ element of the population. 
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The computational results were summarized by computing 
the mean (MEAN), bias (BIAS), mean square error (MSE) 
and average of mean squared error estimates (AVMSE) 
from the set of the 1,000 replicates, given respectively by 


S 
MEAN =). y(k)/S; 


kt 
BIAS = MEAN -Y: 


MsE=> [y)-¥]' ss: 


k=l 


S 
AVMSE = >) V(y(k))/S, 
k=l 

where S is the total number of sample replicates under 
consideration, and Y = 194.34 for the simulation 
population. Empirical coverage rates for 95% confidence 
intervals based on asymptotic normal theory were also 
computed for each strategy and these rates, expressed as 
percentages, are presented in the last columns of Tables 1 
and 3. 
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Scatter plots of the response variable y versus the auxiliary variables. In the plot of y versus x ;; the “+” 
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3.2 Review of the Linear Regression Based 
Strategies 


The linear regression based strategies that have been 
considered by Silva and Skinner (1997) are listed as 
follows. 


SM) Sample mean estimator, with no auxiliary variables (y, V,). 
Fs) Forward selection of auxiliary variables with (y,., V,). 
Fd) Forward selection of auxiliary variables with (jy,,V,). 
Fg) Forward selection of auxiliary variables with (y,,V, ). 


Bs) Best subset selection from all subsets of auxiliary variables with 
(¥> Vs) 


Bd) Best subset selection from all subsets of auxiliary variables with 
(¥,, Va). 

Bg) Best subset selection from all subsets of auxiliary variables with 
OF ’ Vo ): 

FI) _ Fixed subset of auxiliary variable with (y,, V,). 

SS) Saturated subset of auxiliary variable with (y,, V,). 

FR) Forward subset selection using SAS PROC REG, with (7,,V;). 


CN) Condition number reduction subset selection procedure with 
(y, Vs). 


RI) _ Ridge regression estimator proposed by Dunstan and Chambers 
(1986). 


To facilitate the description for the above strategies, we 
define the following notations. Let U = {1,...,N} denote a 
finite population of N distinguishable elements, DCU 
denote a sample replicate of n elements drawn from U by 
simple random sampling without replacement, x, = 


I 


6 ee ee be the vector of auxiliary variables associated 
+ th 


with ie i element, and B= Op23p,). Let X = 
N'Yj<,x, be the vector of population means, x = 
n'YiepX, be the vector of sample means, 
Y=n"' Die Dy; pe the sample mean of the response 
variable, S, =n! Diep (x,-¥)(x, >), S.=n 
(68 Ve ),-2: =14+(X - ay Ss Tea Seine so-called 


is ge (Sarndal, Swensson and Wretman 1989), and 
B= ce S the least squares estimator of B The regression 
estimator A: Y is 


y, =yt(X—=)B. 


The V,,V, is and V, are three estimators of the mean 


squared error of y.. The V. is given by Cochran (1977, 
page 195), 
1- a2 
; oe By 
~ n(n—p-) 
where a and f=n/N is the 


sample fraction. The V, is generalized (from p=1 to 
p >1) from one estimator studied by Deng and Wu (1987) 
and it is expected to have a smaller bias than V, (Silva 
1996), 


A2 
= Ce-. 
4 n(n-1) 2 fal 
where 
Op=(gp =2ef + fA) 
fa—pli-c, -¥y’$(x, —-HMn-DI} 
The V, is modified from one estimator given by Sarndal 
et al. (1989), and it has a similar performance to V,, 
prvi dpa Hert Fi 
é n(n— Dis ) ieD 


The best subset selection strategy (Bs, Bd and Bg) is to 
choose one subset which has the smallest mean squared 
error estimate among all 2” possible subsets. The forward 
selection strategy (Fs, Fd and Fg) starts with the sample 
mean as an estimator, then adds the variable which 
minimizes the mean squared error estimate, and the 
procedure is repeated until the mean squared error estimate 
starts to increase. Refer to Silva and Skinner (1997) for 
details of the implementations of the strategies CN and RI. 


3.3 Illustration on One Sample Replicate 


To understand the behavior of y,y, im presence of 
outliers and the role played by v in robust inference, we 
focus on one particular sample. The training data comprises 
the first 100 elements of the population, and the auxiliary 
variables include x,,...,x, and x,, as the first explanatory 
set. Note that the 53” element has been included in the 
training data. 

For BNN models, we set A=5 and M=8 which 
produces 62 connections for the full BNN model, and tried 
v =25, 50, 100, 200 and +oo, where v = +e is equivalent 
to the assumption €,~ N(0, 07). For each setting, RIEMC 
was run as follows: the network connections were first set to 
some random numbers drawn from N(O, 0.01), and then 
were updated for 1,000 iterations in the parameter space of 
the full model, i.e., all indicator variables are set to 1 in 
those iterations. After the initialization process, 4,000 
iterations of RJEMC were run, and 800 samples were 
collected from these iterations at the lowest temperature 
level with an equal time space. The convergence of RJEMC 
can be diagnosed using the Gelman-Rubin statistic R 
(Gelman and Rubin 1992) based on multiple independent 
runs. Figure 3 shows R values computed from 10 inde- 
pendent runs. For each sample replicate of the simulation 
population, RJEMC converges GR -<eht) very fast, usually 
within the first 500 iterations (100 BNN samples). We 
discarded the first 200 samples for the burn-in process, and 
used the remaining 600 samples for the further inference. 
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For comparison, the linear regression model (1) was also 
applied to this sample replicate. 


id 1.3 1.4 


Gelman-Rubin R. 
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Figure 3.Gelman-Rubin statistic R. The curve was computed 
based on 10 independent runs of RJEMC. The random 
errors are assumed to be distributed according to (100). 


Figure 4 shows the original data together with the fitted 
and predicted values produced by various models. The BNN 
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results were all obtained in one run of RJEMC. It can be 
seen that the linear regression model is not appropriate for 
this population as some fitted and predicted values produced 
by the model are negative for this sample replicate. Also, the 
fitted response curve (the solid curve in Figure 4(a) and 
4(b)) is strongly influenced by the 53" element and lies 
above almost two-thirds of the data points. A similar 
phenomenon occurs for the prediction of unsampled values, 
see Figure 4(c) and 4(d). As a result, the population mean is 
overestimated (Figure 5). Comparing to that of the linear 
regression model, the results of the BNN models are less 
affected by the 53" element, especially for those computed 
with small values of v. Figure 5 shows that as v decreases, 
the estimated population mean by BNN models gets closer 
and closer to the true value, and the estimated 95% 
confidence interval of the population mean becomes 
narrower and narrower. It indicates that the influence of the 
53" element on these estimates becomes weaker and weaker 
as v decreases. This is not surprising as the use of a heavily 
tailed error distribution is known to make the inference 
more robust. 
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Figure 4. Fitted and predicted response curves by various models. The curves are plotted against the proxy variable, and the true response values 
are shown by points. (a) The fitted response curves for the sampled elements. (b) The amplification of the square region of (a). (c) The 
predicted response curves for the unsampled elements. (d) The amplification of the square region of (c), and for clearness only every 


fourth elements are plotted in the order of sorted proxy values. 
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Figure 5. Estimated population mean and the associated 95% 
confidence interval by various models.The dotted line shows 
the true population mean which is 194.34. 


3.4 Numerical Results on More Sample Replicates 


BNN models were applied to analyze the 1,000 sample 
replicates. For each sample replicate of the first explanatory 
set, we set v = 100, A = 5 and M = 8 which produces 62 
connections for the full BNN model. RJEMC was run as 
described in section 3.3. In each run 600 BNN samples were 
obtained for the inference. The computational results were 
summarized in Table 1. It shows that BNN models have 
made a significantly improvement over the linear regression 
based models in population mean estimation for the first 
explanatory set. Although the BNN estimate is slightly 
biased (The relative bias is about 2.5% in terms of absolute 
values and is still acceptable.), it has the smallest MSE value 
among all estimates in Table | and the highest nominal 
coverage probability among the estimates with smaller MSE 
values (the boldfaced rows). As discussed in the last sub- 
section, we expect Y,xx to behave differently for samples 
containing and not containing the outlying element 53. 
When averaged over only those samples that contain 
element 53, Yaxjx with v = 50 performs very well with bias 
1.51 and 99.6% coverage. The result is obviously not as 
good as for those samples not containing element 53 due to 
the inevitable underestimation of the finite population mean. 
Frankly, there is not much one can do if there are outliers in 
the population but none in the sample. No statistical method 
based on sample information alone will be able to predict 
the occurrence of outliers in the non-sample. We believe 
that Ypxjx Will perform very well for populations without 
outliers due to the universal approximation property of 
neural networks and the technique of Bayesian model 
averaging. 

Let x,, denote the average of proxy values of the 
elements in one sample replicate. To see how the perfor- 
mance of the BNN models varied with x,,, we ordered the 
1,000 sample replicates according to their values of x,, and 
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divided them into 20 groups of 50 replicates, the first group 
containing the 50 replicates whose x,, are smallest, and so 
forth. For each group, we calculated MEAN, MSE and 
AVMSE. Figure 6 shows these conditional values. From 
Figure 6(a) it is easy to see that BNN models possess one 
good property, namely, the population mean estimate is not 
sensitive to the value of x,,. From Figure 6(b) it is easy to 
see that AVMSE provides an essentially unbiased estimate 
for MSE regardless of averaged proxy values. 

To assess the influence of v, M and 4 on BNN model size 
and prediction ability for the first explanatory set, we 
conducted three groups of experiments. In the first group of 
experiments, we fixed M = 8 and 4 = 5, and varied the value 
of v, v = 50, 100 and 150. In the second group of experi- 
ments, we fixed v = 100 and A = 5, and varied the value of 
M, M = 6, 8 and 10. In the third group of experiments, we 
fixed v = 100 and M = 8, and varied the value of A, A = 4, 5 
and 6. For each setting, RJEMC was run as described in 
section 3.3 for the 1,000 sample replicates. The compu- 
tational results were summarized in Table 2. It shows that 
the averaged model size produced by each setting is about 
the same, although it increases slowly as M and 4X increase. 
The results of the first group of experiments show clearly 
that for BNN models there is a trade-off between BIAS and 
MSE or AVMSE by choosing the value of v. The results of 
the second and third group of experiments show that BIAS, 
MSE, AVMSE and the coverage probability are rather 
stable to the variation of M and A, although the latter three 
statistics have a slow tendency to increase as M and A 
increase. The increasing trend of these statistics 1s due to the 
fact that the neural networks tend to be overfitted as M and 4 
increase. 


Table 1 
Bias, mean squared error, average of mean squared error estimates 
and empirical coverage of various estimation strategies for the 


population mean using x;,...,x, and x,,; as auxiliary variables. 
Figures other than BNN are reproduced from Silva and Skinner 
(1997). 
Estimation strategy BIAS MSE AVMSE  Coverage* 
(%) 
SM) Sample mean (y, V, ) O58 _ 620109 ~ 1619105 91.8 
CN) Cond. num. red. (¥, V, ) 0.34 507.33 483.63 89.8 
RJ) Ridge PIV UE OY 2a)// (07) 82.5 
Fs) Forward (y,,V,) 040 233-78) 9239.62 82.7 
Fd) Forward (y,,V,) —-1.25 188.08 196.88 82.0 
Fg) Forward (y,,V¢) —1.28 188.38 192.73 81.1 
Bs) Best (¥,.V,) 0.44 236.90 239.49 yen 
Bd) Best (J,.V,) —1.22 190.52 196.84 82.0 
Bg) Best (¥,.V,) 1.24 190.83 192.71 81.1 
FI) Fixed (y,, V,) O29 227: 90" 2424 83.3 
SS) Saturated (y,, V,) OBO W 2335805 9242732 825 
FR) Proc REG (y,,V,;) 0.38 235.86 240.26 82.5 
BNN) 7(100) —4.91 138.11 127.14 84.8 


“Nominal 95% coverage. 
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Figure 6. MEAN (panel (a)), MSE and AVMSE (Panel (b)) conditional on the averaged proxy values. The 1,000 sample replicates 
are ordered on x,, and divided into 20 groups of 50 samples. 


Table 2 
Assessment of the influence of v, M and 4 on BNN model size and prediction ability 
for the first explanatory set. For convenience of comparison, the results of the setting v = 100, 
M=8 and A=5 were repeated in panels B and C. 


Experiment =v M r Size: 
A 50 Swi 5 10:53 

100 SD 10.70 

150 adh ais 10.79 

B 100 Ove ts he 

100 8 ©) 10.70 

100 102" 5 11.83 

c 100 8 4 9.42 

100 rand 10.70 

100 Omg 0 11.83 


1,000 


4Size= Dif] 


’ Nominal 95% coverage. 


The above experiments also address the issue of model 
misspecification. Note the BNN model proposed in this 
paper is specified by the three parameters, v, M and A. Table 
2 shows that the BNN model can still perform well even 
when the parameter setting has some departures from the 
optimal setting. In practice, the setting of v, M and A can be 
determined by a cross-validation experiment. This will be 
demonstrated in the second simulation study. 

Finally, we consider the weaker set of auxiliary variables 
X15+-+»X;9- For each sample replicate, we set v = 100, A =5 
and M = 8 which produces 107 connections for the full 
BNN model. RJEMC was run as in section 3.3. The 


BIAS MSE AVMSE _ Coverage’ (%) 
698 — 13078 90.08 82.0 
=fO W138i. 12714 84.8 
=3.81 15655" | (60:28 85.5 
34.00 "186.92, 122.58 84.1 
BAO VIC i. 127.14 84.8 
—5.14 914013 132.20 86.4 
-4.94 138.04 125.99 852 
ey. 9 ME ht) Be 84.8 
-4.92 139.62 128.64 Sonn 


ye ,m(A;)/M /1,000, where m(A;) is the number of connections of the neural network A;. 


computational results were summarized in Table 3. It shows 
clearly that BNN models continue to provide a significant 
improvement over the linear regression based models in 
population mean estimation when the strongest predictor 
X,, 1s excluded. The BNN estimate has the smallest MSE 
value among all estimates in Table 3, and has the smallest 
bias and the highest nominal coverage probability among 
the estimates with smaller MSE values (the boldfaced 
rows). 

To assess the influence of v, M and >} on BNN model 
sizes,and prediction abilities for the second explanatory set, 
we conducted the same three groups of experiments as for 
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the first explanatory set. The computational results were 
summarized in Table 4. Panel A shows again the trade-off 
between BIAS and MSE or AVMSE made for BNN models 
by the value of v. Panels B and C show that BIAS, MSE, 
AVMSE and the coverage probability have an even more 
stable performance across different choices of M and A than 
that of the first explanatory set. 


Table 3 
Bias, mean squared error, average of mean squared error estimates 
and empirical coverage of various estimation strategies for the 
population mean using X,,...,%X,¢ as auxiliary variables. Figures 
other than BNN are reproduced from Silva and Skinner (1997). 


4. SECOND SIMULATION STUDY 


In the first simulation study, we show that the BNN 
model works well for the data sets with outliers. In this 
simulation study, we show that the BNN model works even 
better for the data sets without outliers. In this study, we also 
demonstrate how a cross-validation procedure can be 
applied to determine a setting for the parameters v, M and 2 
of the BNN model. 

The simulation population comprises the records of the 
serious crimes of 141 large standard Metropolitan Statistical 
Areas (SMSAs) in the United States. A SMSA includes a 
city (or cities) of specified population size. The data 


Estimation strategy BIAS MSE AVMSE _ Coverage* generally pertains to the years 1976 and 1977, and is 
ae ia ie available in Neter, Kutner, Nachtsheim and Wasserman 
SM y,V. ; : 19. ; : : : ; 
pars Oa be ; aM aa eee es ri (1996). We consider the total number of serious crimes in 
goubep te te ee ; , 1977 as the survey variable (y) and the following 9 variables 
RD) Ridge 1.05 480.18 472.82 89.4 anutennalamal variables 
a ential auxi ; 
Fs) Forward (y,,V;) 0.06 468.46 397.99 86.7 e any 
Fd) Forward (y,,V,) —8.12 434.27 338.90 81.7 x Land area (in square miles); 
Fg) Forward (y,,V,) —7.90 433.71 328.46 81.6 Xp Estimated 1977 total population (in thousands); 
Bs) Best (y,,Vs) -0.00 466.16 397.59 86.6 x3 Percent of 1976 SMSA population in central city or 
Bd) Best (Y,,Vq) -~7.90 434.54 336.88 81.5 ou te 
Bg) Best (J,. Ve) 760 433.26 326.05 81.6 7 ae of 1976 SMSA population 65 years old or 
FI) Fixed (y,, Vs) 0.45 490.49 461.86 89.0 Xs Number of professionally active nonfederal physicians 
SS) Saturated (¥,, V,) ~0.20 462.71 413.17 86.9 as of December 31, 1977; 
FR) Proc REG (y,, V;) ZN OF 466.18 399 34 86.4 X6 Total number of beds, cribs, and bassinets during 1977; 
BNN) 1(100) 578 395.25 323.12 86.5 x7 Percent of adult population (persons 25 years old or 
: : - = older) who completed 12 or more years of school, 
“ Nominal 95% coverage. according to the 1970 Census of the Population; 
Xg Total number of persons in civilian labor force (persons 
16 years old or older classified as employed or 
unemployed) in 1977 (in thousands); 
Xo Total current income received in 1976 by residents of 
the SMSA from all sources (in millions of dollars). 
Table 4 


Assessment of the influence of v, M and 4 on BNN model size and prediction ability for the 
second explanatory set. For convenience of comparison, the results of the setting v= 100, M=8 
and i = 5 were repeated in panels B and C of the table. 


Experiment =v M r Size“ 
A 50 8 5 14.87 

100 8 5 15.06 

150 8 5 15.17 

B 100 6 5 13.90 

100 8 5 15.06 

100 10 =) 16.05 

C 100 8 > 13.23 

100 8 2 15.06 

100 8 6 16.76 


BIAS MSE AVMSE_ Coverage’ (%) 
-~9.30 394.11 270.09 82.5 
~~ FAS 8 95:25 ph 329.12 86.5 
~4GSe AID AG. 34615 87.1 
~5.77 394.79 319.13 86.0 
ahi gk hat eR. 86.5 
=5.91 ©396.27 ~~ 327.86 a7 
5,62. 39165, 913268 86.4 
5. 18e. Soe Sloe 86.5 
5:78 39645 321.98 86.6 


Oa ws ae ,m(A;)/M /1,000, where m(A;) is the number of connections of the neural network Aj. 


’ Nominal 95% coverage. 
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Figure 7: Scatter plots of the response variable y versus the auxiliary variables for the second simulation study. 
Table 5 
Cross-validation experiments for the SMSA example. For convenience of comparison, the results of the setting 
v= 100, M=3 and }=5 were repeated in panels B and C. 
Experiment —-v M 2X Size BIAS (10°) MSE(x10°) AVMSE(x10°) Coverage (%) 
A 50 3 a 10.68 —0.472 4.78 4.19 91 
100 3 5 10.74 —0.527 5.04 4.24 92 
00 3 5 10.74 —0.543 4.76 4.21 92 
B 100 1 5 7.29 —0.466 4.63 3.66 89 
100 2 5 9.42 —0.500 4.61 3.91 90 
100 S 5 10.74 —0.527 5.04 4.24 92 
100 4 5 11.66 —0.480 4.74 4.47 91 
ec 100 3 4 9.56 —0.434 4.68 4.12 92 
100 3 5 10.74 —0.527 5.04 4.24 92 
100 3 6 11.82 —0.455 4.66 4.28 08 


“Nominal 95% coverage. 


Figure 7, the scatter plot of y versus the 9 auxiliary 
variables, suggests that a linear regression model may not be 
appropriate for the data set. There is a strong nonlinear 
relationship between y and x,,x,,x, and x,. Also, the 
explanatory variables x,,x5,X%,,X, and x, are highly 
correlated. First, we demonstrate how a cross-validation 
procedure can be applied to determine the setting for the 
parameters v, M and i of the BNN model. We treated the 
first 70 records as a small finite population, generated 100 
sample replicates of size 50 from these 70 records by the 
method of simple random sampling without replacement, 
and then conducted the following experiments. In the first 
group of experiments, we fixed M = 3 and 1 = 5, and varied 
the value of v, v = 50, 100 and ©, where v = ~ is just an 


indicator which indicates the normality assumption for the 
disturbance. Note M = 3 results in a full model of 43 
connections, which has been large enough for the data set. 
In the second group of experiments, we fixed v = 100 and 
=5, and varied the value of M, M = 1, 2, 3, 4. In the third 
group of experiments, we fixed v = 100 and M = 3, and 
varied the value of A, A = 4, 5, 6. For each setting, RJEMC 
was run as in the first simulation study. The computational 
results were summarized in Table 5. It shows that the 
performance of the BNN model is rather stable to the 
variation of the settings. It also suggests that the setting 
v= 100, M = 3 and } = 4 probably be a good setting for this 
simulation population by a synthetical considerations on all 
values of BIAS, MSE, AVMSE and coverage probability. 
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In the further analysis, we generated 500 sample 
replicates of size 70 from all the 141 records by the method 
of simple random sampling without replacement. For each 
replicate, RJEMC was run as in the first simulation study. 
The computational results were summarized in Table 6. It 
shows that the BNN model also works well for this 
population. We also tried the other settings given in Table 5 
for the 500 sample replicates. The computational results are 
all similar. 


Table 6 
Computational results for the second simulation study with 
v=100,M=3 andvA=4 


Size BIAS MSE AVMSE _ Coverage“ 
(x10°) (x10°) (x10°) (%) 
9.20 —0.512 3.36 3.25 92.6 


“Nominal 95% coverage. 


5. DISCUSSION 


In this article, we studied the use of Bayesian neural 
networks in finite population estimation. The numerical 
results show that it has made a significant improvement 
over the linear regression based methods. The improvement 
is not from Bayesian model averaging, but mainly from 
BNN models. We also applied the linear regression based 
Bayesian model averaging method (Liang, Truong and 
Wong 2001) to the same problem, and the improvement 
over Silva and Skinner (1997) is only marginal. Although 
our implementation for BNN models is not specific to finite 
populations, we do not think this is a shortcoming of our 
method. The generality of our method suggests its wide 
applications, for example, in nonlinear regression and 
nonlinear time series (the program is available by an request 
from the first author). Of course, a further research on how 
to use the known auxiliary variable information for a finite 
population in BNN training is also of interest. 


APPENDIX 


Before proving Theorem 2.1, we give one formula which 
will be used in the proof. 


Formula 5.1 (Laplace’s method) 
} b(0) exp{—nh(6) }d0 


Zz 


as n— °°, where b(-) is a general function which does not 
depend on n, h(Q) is a constant-order function of n as 
n—coc, p is the dimension of 9, 6 is the maximizer of 
—h(8) and > = (D?n(6)) is the inverse of the negative 
Hessian matrix evaluated at 4. 


=(2n/n)?!?| S|"? exp{—nh(6)}b(6) {1+ O(n)}, (24) 
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For the general formulation of Laplace’s method, see Kass 
and Vaidyanathan (1992). 


Proof of Theorem 2.1 


Proof: Part (a) By _ definition § of 
E| g(x ,9,)|°*° can be written as 


expectation, 


E, 


K 
Oo = POND) 
k=0 


2+6 


lg (%o.0,)| 2, |Ay, D)d,. 
Following from the normality of the posterior distributions 
1(8, | A,,D) (Walker 1969) and the fact that the activation 
function y(-) in (3) is bounded, we know (9) holds. Walker 
(1969) showed that the posterior distribution is Gaussian in 
the limit of infinite training data. 


Part (b). For a given observation x),E,2(X,),8,) can be 
written as 


Ee r= &(X%,9,) in 

> PCA) [B(x 9.04) exp{—nh(0, )}%(8, | A)d8, 

AeEQ. (25) 
> PCA) fexp{-nh(0, )}(0, | A)d®, 


AeQ 


where 


= 2 ie 2 a? 
log (8, |A) =—logo oe logan 


i=0 


~ = log(2n) + mlogi —log(m!), (26) 


tlt Feed gee bpm f (x,)) 
h(®,)=— BiveD a oo t+ Le 


Fe Bioade 2 eee ign) 


n P pero Vo 
Ms 


v+l nx 
go + EW, - &(x,,0,))° 


lECy, — g(x,))? + (e(x,)-8(%,,0,)} 27) 
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where the first approximation follows from the Taylor 
expansion, log(1+ z) =z, when z lies in a neighbourhood 
of zero; and the second approximation follows from the 
weak law of large numbers by assuming that n is large. Note 
v is often set to a large number, say, a number greater than 
30. In the first example of this paper, we set v = 100. The 
equation (27) implies that the minimum of /(8,) is 
attained when g(x,) = g(x,,0,) holds, that is, 
&(x,,0,) = 8(x,), where 6, =argmin, A(@,). 

By applying Laplace’s method to the numerator of (25) 
with b(-) = &(x,,8, )7(8, |D), we have 


> P(A) | 8(%o.0, )exp{—tH @, )}7(0, | A)dO x 


AeQ 


aut PE anin)”'7|>¥ al! 


AeQ 
exp{—nh(0,)}8 (x90, )#(8, |D) 


= (Xo) > P(A)(2n/n)y””? |) |? 


AeQ 
exp{—nh(, )}7(0, |D), (28) 


where the first approximation follows from the Laplace 
formula (24), and the second approximation follows from 
the equality (x,,.0 ,)=g@(x,). Here we assume that the 
number of hidden units of each A is suffiently large such 
that g(-) can be approximated arbitrarily well by the 
network with properly adjusted weights. Otherwise, that 
term will take a small value and is negligible in the last 
approximation of (28). 

Similarly, by applying the Laplace’s method to the 
denominator of (25) with b(-) = %(8, | D), we have 


> P(A) fexp{-nh(8, )}#(0, | A)dO 


AeQ 


= >) P(A)(2a/n)"”” Id |? exp{-nh ,) }7#(0 , |D). (29) 


AeQ 


Following from (28), (29), and the approximation accuracy 
(O(n"')) of Laplace’s method, we have 


Poa 0 See (30) 


as n — ce. Following from (7), (9) and (30), we have 


M 


il A 
— Vee Oya) 9.2 Ke ly Ces 
of 2 8% 028, > 8%) 


as noo and M > ~, 


Part (c). It follows from (8), (9), (30) and Slutsky’s 
Theorem (Casella and Berger 2002). The proof is 
completed. 
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Simultaneous Use of Multiple Imputation for Missing Data and 
Disclosure Limitation 


JEROME P. REITER ! 


ABSTRACT 


Several statistical agencies use, or are considering the use of, multiple imputation to limit the risk of disclosing respondents’ 
identities or sensitive attributes in public use data files. For example, agencies can release partially synthetic datasets, 
comprising the units originally surveyed with some collected values, such as sensitive values at high risk of disclosure or 
values of key identifiers, replaced with multiple imputations. This article presents an approach for generating multiply- 
imputed, partially synthetic datasets that simultaneously handles disclosure limitation and missing data. The basic idea is to 
fill in the missing data first to generate m completed datasets, then replace sensitive or identifying values in each completed 
dataset with r imputed values. This article also develops methods for obtaining valid inferences from such multiply-imputed 
datasets. New rules for combining the multiple point and variance estimates are needed because the double duty of multiple 
imputation introduces two sources of variability into point estimates, which existing methods for obtaining inferences from 
multiply-imputed datasets do not measure accurately. A reference t-distribution appropriate for inferences when m and r are 
moderate is derived using moment matching and Taylor series approximations. 


KEY WORDS: Confidentiality; Missing data; Public use data; Survey; Synthetic data. 


1. INTRODUCTION 


Many statistical agencies disseminate microdata, i.e., 
data on individual units, in public use files. These agencies 
Strive to release files that are (i) safe from attacks by ill- 
intentioned data users seeking to learn respondents’ 
identities or attributes, (11) informative for a wide range of 
Statistical analyses, and (iii) easy for users to analyze with 
standard statistical methods. Doing this well is a difficult 
task. The proliferation of publicly available databases, and 
improvements in record linkage technologies, have made 
disclosures a serious threat, to the point where most 
statistical agencies alter microdata before release. For 
example, agencies globally recode variables, such as 
releasing ages in five year intervals or top-coding incomes 
above $100,000 as “$100,000 or more” (Willenborg and de 
Waal 2001); they swap data values for randomly selected 
units (Dalenius and Reiss 1982); or, they add random noise 
to continuous data values (Fuller 1993). Inevitably, these 
strategies reduce the utility of the released data, making 
some analyses impossible and distorting the results of 
others. They also complicate analyses for users. To analyze 
properly perturbed data, users should apply the likelihood- 
based methods described by Little (1993) or the mea- 
surement error models described by Fuller (1993). These are 
difficult to use for non-standard estimands and may require 
analysts to learn new statistical methods and specialized 
software programs. 

An alternative approach to disseminating public use data 
was suggested by Rubin (1993): release multiply-imputed, 


synthetic datasets. Specifically, he proposed that agencies (i) 
randomly and independently sample units from the 
sampling frame to comprise each synthetic data set, (ii) 
impute unknown data values for units in the synthetic 
samples using models fit with the original survey data, and 
(iii) release multiple versions of these datasets to the public. 
These are called fully synthetic data sets. Releasing fully 
synthetic data can protect confidentiality, since iden- 
tification of units and their sensitive data is nearly 
impossible when the values in the released data are not 
actual, collected values. Furthermore, with appropriate 
synthetic data generation and the inferential methods 
developed by Raghunathan, Reiter and Rubin (2003) and 
Reiter (2004b), it can allow data users to make valid 
inferences for a variety of estimands using standard, 
complete-data statistical methods and software. Other 
attractive features of fully synthetic data are described by 
Rubin (1993), Little (1993), Fienberg, Makov and Steele 
(1998), Raghunathan et al. (2003), and Reiter (2002, 
2004a). 

No statistical agencies have released fully synthetic 
datasets as of this writing, but some have adopted a variant 
of the multiple imputation approach suggested by Little 
(1993): release datasets comprising the units originally 
surveyed with some collected values, such as sensitive 
values at high risk of disclosure or values of key identifiers, 
replaced with multiple imputations. These are called 
partially synthetic datasets. For example, the U.S. Federal 
Reserve Board protects data in the U.S. Survey of 
Consumer Finances by replacing monetary values at high 
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disclosure risk with multiple imputations, releasing a 
mixture of these imputed values and the unreplaced, 
collected values (Kennickell 1997). The U.S. Bureau of the 
Census and Abowd and Woodcock (2001) protect data in 
longitudinal, linked data sets by replacing all values of some 
sensitive variables with multiple imputations and leaving 
other variables at their actual values. Liu and Little (2002) 
present a general algorithm, named SMIKe, for simulating 
multiple values of key identifiers for selected units. 

All these partially synthetic approaches are appealing 
because they promise to maintain the primary benefits of 
fully synthetic data-—protecting confidentiality while 
allowing users to make inferences without learning 
complicated statistical methods or software — with decreased 
sensitivity to the specification of imputation models (Reiter 
2003). Valid inferences from partially synthetic datasets can 
be obtained using the methods developed by Reiter (2003, 
2004b), whose rules for combining point and variance 
estimates again differ from those of Rubin (1987) and also 
from those of Raghunathan et al. (2003). 

The existing theory and methods for partially synthetic 
data do not deal explicitly with an important practical 
complication: in most large surveys, there are units that fail 
to respond to some or all items of the survey. This article 
presents a multiple imputation approach that handles 
simultaneously missing data and disclosure limitation. The 
approach involves two steps. First, the agency uses multiple 
imputation to fill in the missing data, generating m multiply- 
imputed datasets. Second, the agency replaces the values at 
risk of disclosure in each imputed dataset with r multiple 
imputations, ultimately releasing mr multiply-imputed 
datasets. This double-duty of multiple imputation requires 
new methods for obtaining valid inferences from the 
multiply-imputed datasets, which are derived here. 

The paper is organized as follows. Section 2 reviews 
multiple imputation for missing and partially synthetic data. 
Section 3 presents the new methods for generating partially 
synthetic data and obtaining valid inferences when some 
survey data are missing. Section 4 shows a derivation of 
these methods from a Bayesian perspective, and it discusses 
conditions under which the resulting inferences should be 
valid from a frequentist perspective. Section 5 concludes 
with a discussion of the challenges to implementing this 
multiple imputation approach on genuine data, with an aim 
towards stimulating future research. 


2. REVIEW OF MULTIPLE IMPUTATION 
INFERENCES 


To describe multiple imputation, we use the notation of 
Rubin (1987). For a finite population of size N, let J, =1 if 
unit j is selected in the survey, and J, =0 otherwise, where 


Reiter: Multiple Imputation for Missing Data and Disclosure Limitation 


Pe lel Nolet Pe ry viet R, be a p X 1 vector 
of response indicators, where R,, =1 if the response for 
unit j to survey item k is recorded, and R, =0 otherwise. 
Let R=(R,,..., Ry). Let Y be the N Xp matrix of survey 
data for all units in the population. Let Y;,. =(V54.>Ymis) be 
the n X p matrix of survey data for the n units with / j= ie 
Y,,, 18 the portion of Y,,, that is observed, and Y,;. is the 
portion of Y,,. that is missing due to nonresponse. Let X be 
the N xd matrix of design variables for all N units in the 
population, e.g., stratum or cluster indicators or size 
measures. We assume that such design information is 
known approximately for all population units, for example 
from census records or the sampling frame(s). Finally, we 


write the observed dataas D=(X,Y,,.,/,R). 


2.1 Multiple Imputation for Missing Data 


The agency fills in values for Y,,,. with draws from the 
Bayesian posterior predictive distribution of (Y,,,. |D), or 
approximations of that distribution such as those of 
Raghunathan, Lepkowski, Van Hoewyk and Solenberger 
(2001). These draws are repeated independently / = 1,...,m 
times to obtain m completed data sets, D“ =(D,Y“?). 
Multiple rather than single imputations are used so that 
analysts can estimate the variability due to imputing missing 
data. 

In each imputed data set D“” , the analyst estimates the 
population quantity of interest, Q, using some estimator gq, 
and estimates the variance of g with some estimator u. We 
assume that the analyst specifies g and u by acting as if each 
D was in fact collected data from a random sample of 
(X, Y ) based on the original sampling design J, i.e., g and u 
are complete-data estimators. 

Foret L.2 nile ge? and u\ be respectively the 
values of g and u in data set D“’. Under assumptions 
described in Rubin (1987), the analyst can obtain valid 
inferences for scalar Q by combining the g“ and u“. 
Specifically, the following quantities are needed for 
inferences: 


Am = gyre (1) 
T=1 


b, =a -],)?/m=D (2) 
t=1 


m 


Uy, = yu? /m. (3) 
l=1 


The analyst then can use gq, to estimate Q and 
T,, =(1+1/m)b,,+u,, to estimate the variance of @,,. 


m 
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Inferences can be based on ¢-distributions with degrees of 
freedom v,, =(m—l)+u,, ((1+1/m)b,,))°. 


2.2 Multiple Imputation for Partially Synthetic Data 
when Vine 7 Vins 


Assuming no missing data, i.e., Y,,. =Y,,,, the agency 
constructs partially synthetic datasets by replacing selected 
values from the observed data with imputations. Let Z, =| 
if unit j is selected to have any of its observed data replaced 
with synthetic values, and let Z j= 0 for those units with all 
data left unchanged. Let Z =(Z,, ..., Z,). Let Y,,,, be all 
the imputed (replaced) values in the r synthetic data set, 
and let Y,,.. be all unchanged (unreplaced) values of Y,,.. 
The Y,,,; are assumed to be generated from the posterior 
predictive distribution of (Y,,,; \DeZ ei nos (a tclose 
approximation of it. The values in Y,,.,, are the same in all 
synthetic data sets. Each synthetic data set, d,, then 
comprises (X,Y... ;;Ynrep»4,Z)- Imputations are made 
independently i= 1, ..., 7 times to yield r different partially 
synthetic data sets, which are released to the public. Once 
again, multiple imputations enable analysts to account for 
variability due to imputation. 

The values in Z can and frequently will depend on the 
values in D. For example, the agency may simulate sensitive 
variables or identifiers only for units in the sample with rare 
combinations of identifiers; or, the imputer may replace 
only incomes above $100,000 with imputed values. To 
avoid bias, the imputations should be drawn from the 
posterior predictive distribution of Y for those units with 
Z , =1. Reiter (2003) illustrates the problems that can arise 
when imputations are not conditional on Z. 

Inferences from partially synthetic datasets are based on 
quantities defined in Equations (1) —(3). As shown by Reiter 
(2003), under certain conditions the analyst can use g, to 
estimate Q and T, =b,/r+u, to estimate the variance of 
q,- Inferences for scalar Q can be based on t-distributions 


with degrees of freedom v,, = (r—1)(1+ OOTY) : 


3. PARTIALLY SYNTHETIC DATA 
WHEN Y,,. # You 


When some data are missing, it seems logical to impute 
the missing and partially synthetic data simultaneously. 
However, imputing Y,,,, and Y,,, from the same posterior 
predictive distribution can result in improper imputations. 
For an illustrative example, suppose univariate data from a 
normal distribution have some values missing completely at 
random (Rubin 1976). Further, suppose the agency seeks to 
replace all values larger than some threshold with 
imputations. The imputations for missing data can be based 
on a normal distribution fit using all of Y.,.. However, the 


obs * 


imputations for replacements must be based on a posterior 
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distribution that conditions on values being larger than the 
threshold. Drawing Y,,,, and Y,,, from the same distri- 
bution will result in biased inferences. 

Imputing the Y,;, and Y,,, separately generates two 
sources of variability, in addition to the sampling variability 
in D, that the user must account for to obtain valid 
inferences. Neither T,, nor T, correctly estimate the total 
variation introduced by the dual use of multiple imputation. 
The bias of each can be illustrated with two simple 
examples. Suppose only one value needs replacement, but 
there are hundreds of missing values to be imputed. 
Intuitively, the variance of the point estimator of Q should 
be well approximated by T,,, and T,, should underestimate 
the variance, as it is missing a b,. On the other hand, 
suppose only one value is missing, but there are hundreds of 
values to be replaced. The variance should be well 
approximated by 7,, and T,, should overestimate the 
variance, as it includes an extra b,, . 

To allow users to estimate the total variability correctly, 
agencies can employ a three-step procedure for generating 
imputations. First, the agency fills in Y,,,. with draws from 
the posterior distribution for (Y,;,|D), resulting in m 
completed datasets, D",..., D°”’. Then, in each D“”, the 
agency selects the units whose values are to be replaced, i.e., 
whose Z ee =1. In many cases, the agency will impute 
values for the same units in all D“ to avoid releasing any 
genuine, sensitive values for the selected units. We assume 
this is the case throughout and therefore drop the superscript 
1 from Z. Third, in each D“’, the agency imputes values 
Vee for those units with Z,=1, using the posterior 
distribution for (Y,,, |[D®,Z). This is repeated 
independently i=1,...,7 times for /=1,...,m, so that a 
total of M=mr datasets are generated. Each dataset, 
Bie Ce, eh, REZ), includes! a© label 
indicating the / of the D“ from which it was drawn. These 
M datasets are released to the public. Releasing such nested, 
multiply-imputed datasets also has been proposed for 
handling missing data outside of the disclosure limitation 
context (Shen 2000; Rubin 2003). 

Analysts can obtain valid inferences from these released 
datasets by combining inferences from the individual 
datasets. As before, let g be the analyst’s estimator of Q, and 
let u be the analyst’s estimator of the variance of g. We 
assume the analyst specifies g and u by acting as if each 
d‘” was in fact collected data from a random sample of 
(X,Y) based on the original sampling design /. For 
l=1,...,mandi=1,...,7, let g{ and u.” be respectively 
the values of g and u in data set d{”. The following 
quantities are needed for inferences about scalar Q: 


Gn = 9? lenny => 7 /m (4) 
need 


l= =4 
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by => -9©)?/m(r =) 
P=N7 1 


- S10 /m (5) 
l=1 


By => -Gy)?m-1) 6) 
=a | 


wie 1 S u\? [(mr). (7) 
l=li=1 


The g” is the average of the point estimates in each 
group of datasets indexed by /, and the q,, is the average of 
these averages across /. The b“” is the variance of the point 
estimates for each group of datasets indexed by /, and the 
b,, is average of these variances. The B,, is the variance of 
the g‘” across synthetic datasets. The #,, is the average of 
the estimated variances of g across all synthetic datasets. 

Under conditions described in section 4, the analyst can 
use g,, to estimate Q. An estimate of the variance of gy 
iS: 


T,, =(+1/m) By, —by /r+iby- (8) 


When n, m, and r are large, inferences can be based on 
the normal distribution, (Q—q,,) ~ N(0,T,,). When m 
and r are moderate, inferences can be based on the 
t-distribution, (Q—q,,) ~ a (0,7,,), with degrees of 
freedom 


. (eve Al 


(bu Oe . 
(m — 1) (ae ©) 


m(r—1)T; 


The behavior of 7,, and y,, in special cases is 
instructive. When r is very large, T,, = T,, . This is because 
the g“ = gq‘, so that we obtain the results from analyzing 
the D“. When the fraction of replaced values is small 
relative to the fraction of missing values, the b,, is small 
relative to B,,, so that once again T,, = T,, . In both these 
cases, the v,, approximately equals v,, , which is Rubin’s 
(1987) degrees of freedom when imputing missing data 
only. When the fraction of missing values is small relative 
to the fraction of replaced values, the B,, ~,,/r, so that 


Ty 18 approximately equal to 7, with M released datasets. 


4. JUSTIFICATION OF NEW COMBINING RULES 


This section presents a Bayesian derivation of the 
inferences described in section 3 and describes conditions 
under which these inferences are valid from a frequentist 
perspective. These results make use of the theory developed 
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in Rubin (1987) and Reiter (2003). For the Bayesian 
derivation, we assume that the analyst and imputer use the 
same models. 

Let D” ={D® :] =1,...,m} be the collection of all 
multiply-imputed datasets before any observed values are 
replaced. For each D", let g‘ and u“ be the posterior 
mean and variance of Q. As in Rubin (1987, Chapter 3), let 
B., be the variance of the g“” obtained when m =, 

Let d” ={d :i=1,...,r73;1=1,..., m} be the collection 
of all released synthetic datasets. For each d‘”, let g‘” be 
the posterior mean of g‘. For each 1, let B‘ be the 
variance of the g{” obtained when r=. Lastly, let B be 
the average of the B“ obtained when m =. 

Using these quantities, the posterior distribution for 
(Q|d™ ) can be decomposed as 


f@|d")=| f(@\d",D",B.,B) 
HAW BMT chee es, Mee). 
f(B|d™ )dD" dB_aB. (10) 


The integration is over the distributions of the values in D 
that are missing and the values in each D“ that are 
replaced with imputations; the observed, unaltered values 
remain fixed. We assume standard Bayesian asymptotics 
hold, so that complete-data inferences for Q can be based on 
normal distributions. 


4.1 Evaluating f(Q|d",D”,B.,B) 


Given D”, the synthetic data are irrelevant, so that 
f(@|d" ,D”,B_,B)= f(Q|D”,B.,). This is the poste- 
rior distribution of Q for multiple imputation for missing 
data, conditional on B,,. As shown by Rubin (1987), this 
posterior distribution is approximately 


(Q| D”,B..) ~ N@,,,(1+1/m)B., +7,,) (11) 


where g,, and uw,, are defined as in (1) and (3). In multiple 
imputation for missing data, we integrate (11) over the 
posterior distribution of (B,, | D”). This is not done here, 
since we integrate over (B,, | d oy. 


4.2 Evaluating f(D",B,|d",B)f(B|d™) 


Since the distribution for Q in (11) relies only on q,,, 
Be ands B sit: is.suthicient: for’, FLD SEB |\d”,B) to 
determine 


Tq abo Thy = 
[Gretel Bi sB\CR. |e. Ble 


co? 


Following Reiter (2003), we first assume replacement 
imputations are made so that, for all 7, the sampling 
distributions of each Ge and aie are, 
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Go eee ee. es 


Here, the notation F ~ (G, <<H) means that the random 
variable F has a distribution with expectation of G and 
variability much less than H. In actuality, u,‘” is typically 
centered at a value larger than wu”, since synthetic data 
incorporate uncertainty due to drawing values of the 
parameters. For large sample sizes n, this bias should be 
minimal. The assumption that E(g!” | D©,B®)=q‘ and 
the normality assumption should be reasonable when the 
imputations are drawn from correct posterior predictive 
distributions, f(Y,,, |D‘,Z), and the usual asymptotics 
hold. 

Assuming flat priors for all gq’ and vy, standard 
Bayesian theory implies that 


(q|a™ B)~ NG, B/r) (14) 
(| a B®)~ (7, << B/r) (15) 
se og | ee (16) 


where b°” is defined in (5). We next assume that B“” = B 
for all /. This should be reasonable, since the variability in 
posterior variances tends to be of smaller order than the 
variability of posterior means. Averaging across /, we obtain 


(7,| 4". B)~ N (Gy, B/rm) (17) 


{Z| a", B)~ (@,,, << B/rm) (18) 


m 


where q,, is defined in (4) and u,, is defined in (7). The 
posterior distribution of (B., |d" , B) is 


(m—l)By | 2 
———"| dY Bl~y 19 
(ea Ma es - 


where B,, is defined in (6). 
Finally, the posterior distribution of (B|d™ ) is 


a a" | = ES (20) 


where b,, is defined in (5). 


4.3 Evaluating f(Q|d™ ) 


We need to integrate the product of (11) and (17) with 
respect to the distributions in (19) and (20). This can be 
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done by numerical integration, but it is desirable to have 
simpler approximations for users. 

For large m and r, we can replace the terms in the 
variance with their approximate expectations: the B., =~ 
By, — Bir, and the B= by . Hence, for large m and r, the 
posterior distribution of Q is approximately: 


(o\2") 
~ N(Gy, (1+ 1/m) (By, — By /r)+ By, /mr +i) 


=N(q,,,(1+1/m)B,, —by /r+iy ) 
aN iat) (21) 


When m and r are moderately sized, the normal 
distribution may not be a good approximation. To derive an 
approximate reference f-distribution, we use the strategies of 
Rubin (1987) and Barnard and Rubin (1999). That is, we 
assume that for some degrees of freedom v,, to be 
estimated, 


Yuli 


SSS eS SSS jhe are oD) 
Cerrar Xvy (22) 


so that we can use a f-distribution with v,, degrees of 
freedom for inferences about Q. We approximate v,, by 
matching the first two moments of (22) to those of a chi- 
squared distribution. The details showing that v,, is 
approximated by the expression in (9) are provided in the 
appendix. 

The inferences based on (4) — (9) have valid frequentist 
properties under certain conditions. First, the analyst must 
use randomization-valid estimators, g and u. That is, when q 
and u are applied on D to get g,,, and uy. the 
CRAIN COLLET DNR D OES Raia 
where the relevant distribution is that of /. Second, the 
imputations for missing data must be proper in the sense of 
Rubin (1987, Chapter 4). Essentially, this requires that 
inferences from the imputations for missing data be 
randomization-valid for g,,, and u,,,, under the posited 
non-response mechanism. Third, the imputations for 
partially synthetic data must be synthetically proper in the 
sense of Reiter (2003). This requires that the inferences 
from the replacement imputations associated with each D‘” 
be randomization valid for the g“” and u“”. 

In general, it is difficult to verify that imputations for 
missing data are proper in complex samples (Binder and 
Sun 1996). They may be proper for some analyses but not 
for others. As a result, some confidence intervals centered 
on unbiased estimators may not have nominal coverage 
rates; see Meng (1994) for a discussion of this issue. These 
difficulties exist for the multiple imputation approach used 
here, and indeed may be compounded because of the 
additional imputation of synthetic data. 
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5. CONCLUDING REMARKS 


There are many challenges to using partially synthetic 
data approaches for disclosure limitation. Most important, 
agencies must decide which values to replace with 
imputations. General candidates for replacement include the 
values of identifying characteristics for units that are at high 
risk of identification, such as sample uniques and duplicates, 
and the values of sensitive variables in the tails of 
distributions. Confidentiality can be protected further by, in 
addition, replacing values at low disclosure risk (Liu and 
Little 2002). This increases the variation in the replacement 
imputations, and it obscures any information that can be 
gained just from knowing which data were replaced. As 
with any disclosure limitation method (Duncan, Keller- 
McNulty and Stokes 2001), these decisions should consider 
tradeoffs between disclosure risk and data utility. Guidance 
on selecting values for replacement is a high priority for 
research in this area. 

There remain disclosure risks in partially synthetic data 
no matter which values are replaced. Users can utilize the 
released, unaltered values to facilitate disclosure attacks, for 
example via matching to external databases, or they may be 
able to estimate actual values of Y,,. from the synthetic data 
with reasonable accuracy. For instance, if all people in a 
certain demographic group have the same, or even nearly 
the same, value of an outcome variable, the imputation 
models likely will generate that value for imputations. 
Imputers may need to coarsen the imputations for such 
people. As another example, when users know that a certain 
record has the largest value of some Y,,,., that record can be 
identified when its value is not replaced. 

On the data utility side, the main challenge is specifying 
imputation models, both for the missing and replaced data, 
that give valid results. For missing data, it is well known 
that implausible imputation models can produce invalid 
inferences, although this is less problematic when imputing 
relatively small fractions of missing data (Rubin 1987; 
Meng 1994). There is an analogous issue for partially 
synthetic data. When large fractions of data are replaced, for 
example entire variables, analyses involving the replaced 
values reflect primarily the distributional assumptions 
implicit in the imputation models. When these assumptions 
are implausible, the resulting analyses can be invalid. 
Again, this is less problematic when only small fractions of 
values are replaced, as might be expected in many 
applications of the partially synthetic approach. 

Certain data characteristics can be especially challenging 
to handle with partially synthetic data. For example, it may 
be desirable to replace extreme values in skewed dis- 
tributions, such as very large incomes. Information about 
the tails of these distributions may be limited, making it 
difficult to draw reasonable replacements while protecting 


confidentiality. As another example, randomly drawn 
imputations for highly structured data may be implausible, 
for instance unlikely combinations of family members’ ages 
or marital statuses. These difficulties, coupled with the 
general limitations of inferences based on imputations, point 
to an important issue for research: developing and 
evaluating methods for generating partially synthetic data, 
including semi-parametric and non-parametric approaches. 

We note that building the synthetic data models is 
generally an easier task than building the missing data 
models. Agencies can compare the distributions of the 
synthetic data to those of the observed data being replaced. 
When the synthetic distributions are too dissimilar from the 
observed ones, the imputation models can be adjusted. 
There usually is no such check for the missing data models. 

It is, of course, impossible for agencies to anticipate 
every possible use of the released data, and hence 
impossible to generate models that provide valid results for 
every analysis. A more modest and attainable goal is to 
enable analysts to obtain valid inferences using standard 
methods and software for a wide range of standard analyses, 
such as some linear and logistic regressions. Agencies 
therefore should provide information that helps analysts 
decide what inferences can be supported by the released 
data. For example, agencies can include descriptions of the 
imputation models as attachments to public releases of data. 
Users whose analyses are not supported by the data may 
have to apply for special access to the observed data. 
Agencies also need to provide documentation for how to use 
the nested data sets. Rules for combining point estimates 
from the multiple data sets are simple enough to be added to 
standard statistical software packages, as has been done 
already for Rubin’s (1987) rules in SAS, Stata, and S-Plus. 

As constructed, the multiple imputation approach does 
not calibrate to published totals. This could make some 
users unhappy with or distrust the released data. It is not 
clear how to adapt the method — or, for that matter, many 
other disclosure limitation techniques that alter the original 
data — for calibration. 

Missing data and disclosure risk are major issues 
confronting organizations releasing data to the public. The 
multiple imputation approach presented here is suited to 
handle both simultaneously, providing users with 
rectangular completed datasets that can be analyzed with 
standard statistical methods and software. There are 
challenges to implementing this approach in genuine 
applications, but, as noted by Rubin (1993) in his initial 
proposal, the potential payoffs of this use of multiple 
imputation are high. The next item on the research agenda is 
to investigate how well the theory works in practice, 
including comparisons of this approach with other 
disclosure limitation methods. These comparisons should 
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focus on measures of disclosure risks, obtained by sim- 
ulating intruder behavior, and on measures of data utility for 
estimands of interest to users, including properties of point 
and interval estimates. 


APPENDIX: DERIVATION OF APPROXIMATE 
DEGREES OF FREEDOM 


Inferences from datasets with multiple imputations for 
both missing data and partially synthetic replacements are 
made using a f-distribution. A key step is to approximate the 
distribution of 


Vulu 


[ +(1+1/m) B., arr a | 


as a chi-squared distribution with v,, degrees of freedom. 
The v,, is determined by matching the mean and variance 
of the inverted x’ distribution to the mean and variance of 
(23), 

Let a=(B,,+B/r)/B,, and let y=B/b,,. Then, 
(a '|d",B) and (y'|d”) have mean square dis- 
tributions with degrees of freedom m-—1 and m(r-1), 
respectively: “Let f =—(1-F1/ m)B,, fu,,, and’ let ‘g= 
(1/r)b,, /ii,, . We can write (23) as 


(23) 


Ty bu iy (1+ f -g) (24) 
iy +(1+1/m) B, + B/mr iy, (It af —yg) 


To match moments, we need to approximate the expectation 
and variance of (24). 
For the expectation, we use the fact that 


l+af — yg 
= oo AEs ; || ad” , (25) 
+ of — 8 
We approximate these expectations using first order Taylor 


series expansion in a’ and y' around their expectations, 
which equal one. As a result, 


ofa Ee faa | 
l+af —yg 


no He amar (26) 
ee he 


For the variance, we use the conditional variance 
representation 
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Bea | a alla” 
ie ath 


Leap mie M M 
Var) E} ——————_| d" ,B ||d : 
4 uf eae } | (27) 


For the interior variance and expectation, we use a first 
order Taylor series expansion in a ' around its expectation. 
Since Var(a'|d™, B)=2/(m—1), the expression in (27) 
equals approximately 


2. pe 
zt a fas at 
(m—I+ f -g) 


l+f-g 4 
+ Var} ——————-| d : 


We now use first order Taylor series expansions in y" 
around its expectation to determine the components of (28). 
The first term in (28) is, 


lr fae)he supe 
fo! 
feat rear 
Df 
we (29) 
cae ee yee 
Since War(y"' |d”)=2/(m(r-1)), the second term in 
(28) is 


val TEs a") 
laf ¥2 
x5 2g" 
Fit abi) cee i 


Combining (29) and (30), the variance of (23) equals 
approximately 


2 


of Se OMer 
Gaaiieey 2) 
se el 
la US ara 8 


2 


(31) 


Since a mean square random variable has variance equal to 
2 divided by its degrees of freedom, we conclude that 


[alos m(r—1)(1+ f—g) . (32) 
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Erratum: 


In the June 2004 issue, we published a paper by D.N. Da Silva and Jean D. Opsomer on “Properties of the Weighting Cell 
Estimator Under a Nonparametric Response Mechanism” (pages 45-55). We would like to apologize for having incorrectly 
spelled out Dr. Da Silva’s name. It should have read D. Nobrega Da Silva. Please note also that the corrected version appears 
on Statistics Canada’s Web site. 
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GUIDELINES FOR MANUSCRIPTS 


Before having a manuscript typed for submission, please examine a recent issue of Survey Methodology (Vol. 19, No. 1 and 
onward) as a guide and note particularly the points below. Articles must be submitted in machine-readable form, preferably 
in Word. A paper copy may be required for formulas and figures. 


Layout 


Manuscripts should be typed on white bond paper of standard size (8/2 x 11 inch), one side only, entirely double 
spaced with margins of at least 1% inches on all sides. 

The manuscripts should be divided into numbered sections with suitable verbal titles. 

The name and address of each author should be given as a footnote on the first page of the manuscript. 
Acknowledgements should appear at the end of the text. 

Any appendix should be placed after the acknowledgements but before the list of references. 


Abstract 


The manuscript should begin with an abstract consisting of one paragraph followed by three to six key words. Avoid 
mathematical expressions in the abstract. 


4.1 


4.2 


5.1 
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Style 


Avoid footnotes, abbreviations, and acronyms. 

Mathematical symbols will be italicized unless specified otherwise except for functional symbols such as “exp(-)” 
and “log(-)”, etc. 

Short formulae should be left in the text but everything in the text should fit in single spacing. Long and important 
equations should be separated from the text and numbered consecutively with arabic numerals on the right if they are 
to be referred to later. 

Write fractions in the text using a solidus. 

Distinguish between ambiguous characters, (e.g., W, ; 0, O, 0; 1, 1). 

Italics are used for emphasis. Indicate italics by underlining on the manuscript. 


Figures and Tables 


All figures and tables should be numbered consecutively with arabic numerals, with titles which are as nearly self 
explanatory as possible, at the bottom for figures and at the top for tables. 

They should be put on separate pages with an indication of their appropriate placement in the text. (Normally they 
should appear near where they are first referred to). 


References 


References in the text should be cited with authors’ names and the date of publication. If part of a reference is cited, 
indicate after the reference, e.g., Cochran (1977, p. 164). 

The list of references at the end of the manuscript should be arranged alphabetically and for the same author 
chronologically. Distinguish publications of the same author in the same year by attaching a, b, c to the year of 
publication. Journal titles should not be abbreviated. Follow the same format used in recent issues. 
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